Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIMEOUT error reported by data_service_request_bidirectional call (no flip, no restarts, pure I/O) #87

Open
yamingk opened this issue May 17, 2024 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@yamingk
Copy link
Contributor

yamingk commented May 17, 2024

After running I/O for awhile, TIMEOUT error can be reported by data_service_request_bidirectional while fetching data, this is not expected as there is no flip, no restarts involved in any replica, just pure I/O:

This is seen with real drives, not file.

[05/15/24 13:38:29-07:00] [E] [test_raft_repl_dev_replica_2] [1574] [raft_repl_dev.cpp:515] [rd=rdev1] Not able to fetching data from originator=1251922335, error=-2, probably originator is down. Will retry when new leader start appending log entries
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64118], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=917505 count=2 chunk=181},] remote_blkid=[{blk#=737284 count=2 chunk=619},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64119], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=425986 count=2 chunk=182},] remote_blkid=[{blk#=1032193 count=2 chunk=620},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64120], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=671751 count=4 chunk=183},] remote_blkid=[{blk#=737281 count=4 chunk=621},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64123], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=327683 count=1 chunk=184},] remote_blkid=[{blk#=753666 count=1 chunk=624},]] error=TIMEOUT
[05/15/24 13:38:35-07:00] [E] [test_raft_repl_dev_replica_2] [1626] [raft_repl_dev.cpp:692] [rd=rdev1] Raft Channel: Error in processing rreq=[repl_key=[server=1251922335, term=1, dsn=64121], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=0 m_keysize=0 is_proposer=false local_blkid=[{blk#=704515 count=5 chunk=185},] remote_blkid=[{blk#=376838 count=5 chunk=622},]] error=TIMEOUT
@yamingk yamingk added this to the MileStone4.3 milestone May 17, 2024
@yamingk yamingk added the enhancement New feature or request label May 17, 2024
@szmyd
Copy link
Collaborator

szmyd commented May 31, 2024

Why is this opened in the nuraft_mesg project and not homestore which is where the issue seems to be according to the logs? If there's some raft issue; then we need raft logs...not homestore logs which would indicate a bug on the library user, not the library itself. Not saying it is, but that's the only thing left here to look at...

@yamingk
Copy link
Contributor Author

yamingk commented Jun 3, 2024

It is return immediately at line:506. And we need to look into why mesg_factory::data_service_request_bidirectional is returning -2 when there is no reboot in between. We don't know where the root cause is yet without deeper investigation. It could be in upper layer or lower layer caused this problem.

 505     group_msg_service()
 506         ->data_service_request_bidirectional(
 507             originator, FETCH_DATA,
 508             sisl::io_blob_list_t{
 509                 sisl::io_blob{builder->GetBufferPointer(), builder->GetSize(), false /* is_aligned */}})
 510         .via(&folly::InlineExecutor::instance())
 511         .thenValue([this, builder, rreqs = std::move(rreqs)](auto response) {
 512             if (!response) {
 513                 // if we are here, it means the original who sent the log entries are down.
 514                 // we need to handle error and when the other member becomes leader, it will resend the log entries;
 515                 RD_LOG(ERROR,
 516                        "Not able to fetching data from originator={}, error={}, probably originator is down. Will "

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants