Details
Description
A race condition is causing occasional skipped heartbeats.
The likelihood is increased by (a) multiple sessions, (b) slow message store, (c) frequent heartbeats e.g. HeartBtInt=1.
The "QFJ Timer" thread wakes up every second and looks at all sessions to see who needs a HeartBeat to be sent. The SystemTime.currentTimeMillis() is read at the moment the outgoing message header is created.
Suppose you have a "quiet" session that is only exchanging heartbeats, and one or more other "chatty" sessions that don't always need a Heartbeat to be sent because of other traffic being sent.
Suppose at time T, all the sessions get a heartbeat, but as the heartbeats are sent by the single QFJ Timer thread, the last session's HeartBeat is sent at time T + 1 ms. This delay is a function of the CPU and MessageStore speed, not any network timings.
Then suppose at time T + 1000, only the "quiet" session needs a heartbeat.
But SessionState.isHeartBeatNeeded() only sees millisSinceLastSentTime = 999, so no heart beat is sent, and the counterparty has to send a Test Request to see if we're alive.
Of course, we are alive and the TEST exchange occurs fine, but we look sloppy.
There are already HeartBtInt "fudge factors" of 1.5 in isTestRequestNeeded() and 2.4 in isTimedOut(), I propose to introduce a 10 ms "leeway" value to the isHeartBeatNeeded() method.
This fixes my problem. A patch against svn r923 is attached.