Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Broken when ConnectionOptions::socket_timeout is set #49

Closed
dellorogiulio opened this issue Mar 20, 2020 · 30 comments
Closed

Comments

@dellorogiulio
Copy link

dellorogiulio commented Mar 20, 2020

Hi, I'm have a problem with the Subscriber class when setting a ConnectionOptions::socket_timeout.
Basic example:

int main(int argc, char const *argv[])
{
	using namespace sw::redis;
	using namespace std::chrono_literals;
	ConnectionOptions connection_option;
	connection_option.host = "127.0.0.1";
	connection_option.port = 6379;
	// connection_option.socket_timeout = 10ms; /* comment out to obtain the error */

	auto redis = Redis(connection_option);
	auto sub = redis.subscriber();
	sub.on_message([](std::string channel, std::string msg) {
		std::cout << "Get " << msg << " on " << channel << std::endl;
	});
	sub.subscribe("topic_1");
	while (redis.ping() == "PONG")
	{
		try
		{
			sub.consume();
		}
		catch (const TimeoutError &timeout)
		{
			std::cout << "TIMEOUT" << std::endl;
			continue;
		}
		catch (const Error &err)
		{
			std::cout << err.what() << std::endl;
		}
		std::cout << "Pass" << std::endl;
		std::this_thread::sleep_for(100ms);
	}
	std::cout << "Exit" << std::endl;
	return 0;
}

All test are done voluntarily with no one publishing on that topics.
Without the explicit set of socket_timeout, all works as expected:

  • a first pass is write
  • connection is blocked since there is no incoming messages.

With the explicit set of socket_timeout (comment out the specific line), the output is:

  • Pass
  • Failed to get reply: Unknown error code
  • Pass
  • Connection is broken
  • Pass
  • Connection is broken
    [so on and so forth]

NOTE: the connection is broken after some seconds
In this case I would expect to see a TIMEOUT print, but no one is printed.
In my case, this behavior is 100% repeatable.
This error could be related to this modification? Am I missing something?

Thanks in advance!

@sewenew
Copy link
Owner

sewenew commented Mar 20, 2020

Hi @dellorogiulio

I'm sorry, but this is a bug. The new version of hiredis has a new error code: REDIS_ERR_TIMEOUT, while redis-plus-plus doesn't handle it correctly. I've fixed this problem, and the code has been merged into the timeout branch. You can try it.

This branch also fixed the other issue you opened earlier.

Thanks a lot for finding these bugs! If you still have any problem, feel free to let me know :)

Regards

@dellorogiulio
Copy link
Author

Hi, thanks for the response!
I have no problem to try the timeout branch. However is it stable? I'm working to include your library in a big project and I need absolute stability.
Thanks

@sewenew
Copy link
Owner

sewenew commented Mar 20, 2020

Yes, this branch is stable, and pass all tests.

I still need to add some code to ensure that this fix can work with older version of hiredis. When I finish that (this weekend), I'll merge the timeout branch into master. And the master branch is always stable.

However, since I don't have MinGW environment, I cannot run tests on it. Could you please run these tests on MinGW, and give me some feedback? Thanks a lot!

Regards

@dellorogiulio
Copy link
Author

dellorogiulio commented Mar 20, 2020

I could try to run timeout-branch tests this weekend. If you could notify me when you will merge into master, I will port the new master into my project (I performed some changes in the files structure to meet my project requirements) and re-try the above code.
Thanks! We stay in touch

OT: why do you decide to define callbacks with string passed by value and not reference? Ok for ownership (we could change the copied value) but, from my point of view, this is a lost of performance. Consider the case when we pass "stringify" values, such as profobuf or msgpack; these generally need a deserialization to obtain the desire object which generally makes a copy of the passed string. Why not decide to pass all by const reference and avoid the user to directly modify the string?
Obviously it just a personal opinion and I would appreciate more details.
Thanks!

@dellorogiulio
Copy link
Author

dellorogiulio commented Mar 21, 2020

I port your timeout branch changes into my code. Running this example now the Connection Broken problem seems ok!
However, I notice something strange: consider this example:

int main(int argc, char const *argv[])
{
	using namespace sw::redis;
	ConnectionOptions connection_option;
	connection_option.host = "127.0.0.1";
	connection_option.port = 6379;
	connection_option.socket_timeout = std::chrono::milliseconds(10);

	auto redis = Redis(connection_option);
	auto sub = redis.subscriber();
	sub.on_message([](std::string channel, std::string msg) {
		std::cout << "Get " << msg << " on " << channel << std::endl;
	});
	sub.subscribe("topic_1");
	while (redis.ping() == "PONG")
	{
                auto start_consume = std::chrono::high_resolution_clock::now();
		try
		{
			sub.consume();
		}
		catch (const TimeoutError &timeout)
		{
			auto timeout_time = std::chrono::high_resolution_clock::now();
			auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timeout_time - start_consume).count();
			std::cout << "TIMEOUT after " << duration << std::endl;
			continue;
		}
		catch (const Error &err)
		{
			std::cout << err.what() << std::endl;
		}
		std::cout << "Pass" << std::endl;
		std::this_thread::sleep_for(100ms);
	}
	std::cout << "Exit" << std::endl;
	return 0;
}

OUTPUT:

  • PASS
  • TIMEOUT after 10016
  • TIMEOUT after 10000
  • TIMEOUT after 10021

It seems that the timeout fires always after 10 seconds and not milliseconds.
I tried to set socket_timeout = std::chrono::milliseconds(1); and, in this case, the timeout fires after 1 second.
I try to debug the path of the socket_timeout and all seems good to me: the actual milliseconds value is correctly set to the corresponding timeval and then pass to setsockopt in the hiredis library.
Did you ever notice this type of behavior?

MINGW TEST:

Currently I was able to run only these tests since I'm under windows and I have Redis 4.x.
As I know, there is no Radis 5.x stable for windows (but I could be wrong), so I cannot go further.

Testing Redis...
Pass sanity tests
Pass connection commands tests
Pass keys commands tests
Pass string commands tests
Pass list commands tests
Pass hash commands tests
Pass set commands tests
Test failed: ERR unknown command `ZPOPMAX`, with args beginning with: `{sw::redis::test}::zpop`, `1`,

@sewenew
Copy link
Owner

sewenew commented Mar 22, 2020

Hi @dellorogiulio

The REDIS_ERR_TIMEOUT problem has been fixed, and the code has been merged into the master branch. It passed all tests. It's stable, and you can depend on it.

I also did some research on this new error code. It seems hiredis only reports this error (connection timeout) on Windows platform. On Linux or macOS it still uses the old way to report timeout, i.e. REDIS_ERR_IO with errno setting to EAGAIN/EWOULDBLOCK. Now this fix can work with the older version and latest version of hiredis.

Thanks again for your work!

Regards

@sewenew
Copy link
Owner

sewenew commented Mar 22, 2020

Ok for ownership (we could change the copied value) but, from my point of view, this is a lost of performance.

There's no lost of performance, except an extra string move constructor, which is ignorable:

void Subscriber::_handle_message(redisReply &reply) {
    // other code...
    auto msg = reply::parse<std::string>(*msg_reply);
    _msg_callback(std::move(channel), std::move(msg));    // <--- msg is moved into the callback, no copy.
}

Consider the case when we pass "stringify" values, such as profobuf or msgpack; these generally need a deserialization to obtain the desire object which generally makes a copy of the passed string.

In this case, why we need to makes a copy of the passed string? You can just pass the received message to the parser of protobuf. Or I miss something?

sub.on_message([](std::string channel, std::string msg) {
		ProtoMsg msg;
                msg.ParseFromString(msg);         // <---- there's no copy, but pass it by reference to ParseFromString
	});

Also, if you store protobuf into Redis, you can try redis-protobuf module, which can avoid serialization/deserialization on the client side, in some case. However, in the pub/sub case, you cannot avoid serialization/deserialization, since we can only publish or subscribe string values.

why do you decide to define callbacks with string passed by value and not reference?
....
Why not decide to pass all by const reference and avoid the user to directly modify the string?

Because sometimes we need a copy of the message, for example:

  • save the message to some container, and process a batch of messages
  • use the message to construct a struct, which has a string data member
  • modify the message, e.g. make all characters lowercase.
  • and so on...

In these cases, if the message is passed by const references, the user code has to make a copy of the message. Instead, if the message is passed by value, user code can move the message to avoid a copy.

Also in the callback, since the message is owned by the user code, user can decide whether to modify it or not.

Regards

@sewenew
Copy link
Owner

sewenew commented Mar 22, 2020

Did you ever notice this type of behavior?

No, I didn't meet this problem on Linux or macOS. I run your program, and the output is:

Pass
TIMEOUT after 11
TIMEOUT after 10
TIMEOUT after 10

I took a look at the hiredis code. It seems that it's a hiredis bug:

in the net.c file, it convert timeval to a DWORD, i.e. milliseconds, and calls setsockopt.

int redisContextSetTimeout(redisContext *c, const struct timeval tv) {
    const void *to_ptr = &tv;
    size_t to_sz = sizeof(tv);
#ifdef _WIN32
    DWORD timeout_msec = tv.tv_sec * 1000 + tv.tv_usec / 1000;
    to_ptr = &timeout_msec;
    to_sz = sizeof(timeout_msec);
#endif
    if (setsockopt(c->fd,SOL_SOCKET,SO_RCVTIMEO,to_ptr,to_sz) == -1) {
        __redisSetErrorFromErrno(c,REDIS_ERR_IO,"setsockopt(SO_RCVTIMEO)");
        return REDIS_ERR;
    }
    // other code...

This seems right. However, net.c includes sockcompat.h, which defines setsockopt to win32_setsockopt:

#define setsockopt(sockfd, level, optname, optval, optlen) win32_setsockopt(sockfd, level, optname, optval, optlen)

In sockcompat.c, win32_setsockopt does another conversion for timeval from timeval to DWORD. And I think this is the problem.

int win32_setsockopt(SOCKET sockfd, int level, int optname, const void *optval, socklen_t optlen) {
    int ret = 0;
    if ((level == SOL_SOCKET) && ((optname == SO_RCVTIMEO) || (optname == SO_SNDTIMEO))) {
        struct timeval *tv = optval;
        DWORD timeout = tv->tv_sec * 1000 + tv->tv_usec / 1000;
        ret = setsockopt(sockfd, level, optname, (const char*)&timeout, sizeof(DWORD));
    } else {
        ret = setsockopt(sockfd, level, optname, (const char*)optval, optlen);
    }
    _updateErrno(ret != SOCKET_ERROR);
    return ret != SOCKET_ERROR ? ret : -1;
}

You can try the code, i.e. convert timeval to DWORD twice, in MINGW64, to see whether the 10ms has been changed to 10s.

MINGW TEST

You can just comment out the _test_zpoprelated test code in zset_cmds_test.hpp, and the stream related test code in stream_cmds_test.hpp to run other tests.

Regards

@sewenew sewenew closed this as completed Mar 22, 2020
@sewenew
Copy link
Owner

sewenew commented Mar 22, 2020

Hi @dellorogiulio

Sorry, I closed this issue by accident. Now I reopen it.

And please keep it open, until we solve all problems.

Sorry again.

Regards

@sewenew sewenew reopened this Mar 22, 2020
@sewenew
Copy link
Owner

sewenew commented Mar 22, 2020

I run the following test code on macOs by defining DWORD to unsigned long (I'm not sure about the exact definition of DWORD on MINGW64), it does change 10ms to 10s:

int main() {
    timeval tv;
    tv.tv_sec = 0;
    tv.tv_usec = 10 * 1000;

    void *to_ptr = &tv;
    size_t to_sz = sizeof(tv);

    using DWORD = unsigned long;
    DWORD timeout_msec = tv.tv_sec * 1000 + tv.tv_usec / 1000;
    to_ptr = &timeout_msec;

    timeval *tv1 = (timeval*)to_ptr;
    DWORD timeout = tv1->tv_sec * 1000 + tv1->tv_usec / 1000;

    cout << timeout << endl;

    return 0;
}

OUTPUT: 10000

Maybe we can open an issue or PR for hiredis.

Regards

@dellorogiulio
Copy link
Author

Hi @sewenew ,
We are discussing about too many arguments in this issue. Let me clean this up and move the MINGW Test in another "issues".
Thanks for the response about subscriber callback, it's all clear now!

Damn! Yesterday I look in that specific code and I miss the bug! Your definition of DWORD as unsigned long seems correct. But, as you clearly highlight, the error is not in the conversion from DWORD to timeval, but the cast from a DWORD* to timeval*!

Let me summarize the problem so that I can open the Issue on the hiredis repo:

net.c

316    int redisContextSetTimeout(redisContext *c, const struct timeval tv) {
317        const void *to_ptr = &tv;
318        size_t to_sz = sizeof(tv);
319    #ifdef _WIN32
320        DWORD timeout_msec = tv.tv_sec * 1000 + tv.tv_usec / 1000;
321        to_ptr = &timeout_msec;
322       to_sz = sizeof(timeout_msec);
323    #endif
324        if (setsockopt(c->fd,SOL_SOCKET,SO_RCVTIMEO,to_ptr,to_sz) == -1) {
325            __redisSetErrorFromErrno(c,REDIS_ERR_IO,"setsockopt(SO_RCVTIMEO)");
326            return REDIS_ERR;
327        }
328        if (setsockopt(c->fd,SOL_SOCKET,SO_SNDTIMEO,to_ptr,to_sz) == -1) {
329            __redisSetErrorFromErrno(c,REDIS_ERR_IO,"setsockopt(SO_SNDTIMEO)");
330            return REDIS_ERR;
331        }
332       return REDIS_OK;
333    }

Line 320 a DWORD timeout_msec is correctly computed from the timeval tv
Line 321 to_ptr is assign to the address of timeout_msec (to_ptr is di facto a DWORD*)
Line 324 to_ptr is pass to setsockopt which is win32_setsockoption

sockcompat.c

212    int win32_setsockopt(SOCKET sockfd, int level, int optname, const void *optval, socklen_t optlen) {
213        int ret = 0;
214        if ((level == SOL_SOCKET) && ((optname == SO_RCVTIMEO) || (optname == SO_SNDTIMEO))) {
215            struct timeval *tv = optval;
216            DWORD timeout = tv->tv_sec * 1000 + tv->tv_usec / 1000;
217            ret = setsockopt(sockfd, level, optname, (const char*)&timeout, sizeof(DWORD));
218        } else {
219            ret = setsockopt(sockfd, level, optname, (const char*)optval, optlen);
220        }
221        _updateErrno(ret != SOCKET_ERROR);
222        return ret != SOCKET_ERROR ? ret : -1;
223    }

Line 215 casts optval (which is a DWORD*) to timeval* and this obviously causes the error.

I don't understand why the #ifdef _WIN32 was introduced at line 320, In fact, commenting out line 320, 321, 322, 323 all seems work as expected! Now the timeout is called after the correct amount of milliseconds!

@sewenew
Copy link
Owner

sewenew commented Mar 23, 2020

Nice work!

@dellorogiulio
Copy link
Author

Just to keep trace: now hiredis will be update soon fixing this bug (he simply removed lines 320, 321, 322, 323 from net.c)
He also add some WIndows test too! Which are always welcome!

@labar90
Copy link

labar90 commented Sep 13, 2021

Hello, I´m having the same bad behavior running the same example in the version 1.3.1 with TLS connection options. If I run the example without TLS it runs as expected but if I run with TLS then I have this output:

  • Pass
  • Failed to get reply: Unknown error code
  • Pass
  • Connection is broken
  • Pass
  • Connection is broken

Im using msvc 2015 environment.

Thank you in advance.

@sewenew
Copy link
Owner

sewenew commented Sep 13, 2021

@labar90 I don't have a Windows box nearby, however, I tried the code with TLS on Linux, and cannot reproduce the problem.

Also, please check the version of redis-plus-plus you used, the error message you gave in the comment, i.e. Failed to get reply: Unknown error code, has already been modified when I fixed this issue. It seems that you used an old version of redis-plus-plus. Please try the latest code to see if the problem still exists.

Regards

@labar90
Copy link

labar90 commented Sep 13, 2021

Yes, you are right. I copied the output from the previous message thinking it´s exactly the same output but my output is:

  • Pass
  • Failed to get reply: No error
  • Pass
  • Connection is broken
  • Pass
  • Connection is broken

@sewenew
Copy link
Owner

sewenew commented Sep 13, 2021

@Iabar90 What's version of hiredis did you use? The No error message should be reported by hiredis. However, I did some search with hiredis-1.0.0, and did not find No error (case matters) in the hiredis code.

Regards

@labar90
Copy link

labar90 commented Sep 13, 2021

@sewenew I am using hiredis v1.0.0

Regards.

@sewenew
Copy link
Owner

sewenew commented Sep 14, 2021

@Iabar90 I created a "debug" branch, which add some more error message when throwing error. Can you try the code on this debug branch, and give me the updated error message? So that I can do some debug on this problem. Thanks.

Regards

@labar90
Copy link

labar90 commented Sep 14, 2021

Hello @sewenew, I have tested the debug branch.
This is the output:

  • Pass
  • Failed to get reply: No error, 1, 0
  • Pass
  • Connection is broken
  • Pass
  • Connection is broken
  • Pass

@sewenew
Copy link
Owner

sewenew commented Sep 14, 2021

@Iabar90 Thanks for the info! It seems a bug of hiredis, which returns an IO error while the errno is 0. This should not happen. It seems that hiredis doesn't handle the SSL_read error properly. I'll do more digging.

// hiredis/ssl.c
        int err = SSL_get_error(rssl->ssl, nread);
        if (c->flags & REDIS_BLOCK) {     // <------------this might be the problem, even in blocking mode, we should check the returned err from SSL_read
            if (errno == EINTR) {
                return 0;
            } else {
                const char *msg = NULL;
                if (errno == EAGAIN) {
                    msg = "Resource temporarily unavailable";
                }
                __redisSetError(c, REDIS_ERR_IO, msg);
                return -1;
            }
        }

        /**
         * We can very well get an EWOULDBLOCK/EAGAIN, however
         */
        if (maybeCheckWant(rssl, err)) {

Regrads

@labar90
Copy link

labar90 commented Sep 14, 2021

Thank you so much @sewenew

@sewenew
Copy link
Owner

sewenew commented Sep 15, 2021

@Iabar90 I add some debug info for hiredis. However, the code works fine on both Linux and macOS. I cannot reproduce your problem. Can you try the modified hiredis in your Windows env? The modified code is on the master branch of https://github.com/sewenew/hiredis

Please ensure that you uninstall the previously installed hiredis before install this modified version. If there're multiple versions of hiredis install, you'll get some wired problems.

When you run the code, please give me the output of the code. It should print some ssl debug info to stderr, and some redis-plus-plus error on stdout. Please give both error messages. Thanks!

Regards

@labar90
Copy link

labar90 commented Sep 16, 2021

Hello @sewenew,
This is the new output:

  • Pass
  • debug ssl error: SSL_ERROR_SYSCALL: 0
  • debug ssl error: in blocked: 0
  • Failed to get reply: No error, 1, 0
  • Pass
  • Connection is broken
  • Pass
  • Connection is broken
  • Pass

Regards and thanks.

@sewenew
Copy link
Owner

sewenew commented Sep 17, 2021

@labar90 I created a new issue to track this problem. I also made a hot fix for hiredis, and you can try this version of hiredis to see if it's solved the problem.

Please check this for detail.

Regards

@labar90
Copy link

labar90 commented Sep 17, 2021

@sewenew I have just checked this new version and it does work. Thank you so much.

@sewenew
Copy link
Owner

sewenew commented Sep 17, 2021

@labar90 According to OpenSSL community, that’s a bug for OpenSSL 1.x. Can you upgrade OpenSSL to 3.0, and try the original version of hiredis? Thanks!

Regards

@labar90
Copy link

labar90 commented Sep 22, 2021

Hello @sewenew I have tested with OpenSSL 3.0 and it has the same wrong behaviour. It just works fine with your branch.

Regards.

@sewenew
Copy link
Owner

sewenew commented Sep 22, 2021

@Iabar90 Can you also try the debug branch of redis-plus-plus I mentioned in the above comments, and hiredis version 1.0.0 (not the hiredis I modified) with OpenSSL 3.0, and give me the output of error messages? So that I can do more debug, and maybe ask the hiredis community to fix it.

Thanks a lot!

Regards

@labar90
Copy link

labar90 commented Sep 26, 2021

Hello @sewenew , I have tested with redis-plus-plus debug and hiredis 1.0.0 and OpenSSL 3.0.

The output is:

Pass
Failed to get reply: No error, 1, 0
Pass
Connection is broken
Pass
Connection is broken
Pass
Connection is broken
Pass

Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants