[QFJ-448] Skipped heartbeats Created: 30/Jun/09 Updated: 02/Apr/15 Resolved: 03/Apr/14 |
|
Status: | Closed |
Project: | QuickFIX/J |
Component/s: | Engine |
Affects Version/s: | 1.3.1, 1.3.2, 1.3.3, 1.4.0 |
Fix Version/s: | 1.6.0 |
Type: | Bug | Priority: | Default |
Reporter: | Scott Harrington | Assignee: | Christoph John |
Resolution: | Fixed | Votes: | 0 |
Labels: | None |
Attachments: | QFJ-448.patch |
Description |
A race condition is causing occasional skipped heartbeats. The likelihood is increased by (a) multiple sessions, (b) slow message store, (c) frequent heartbeats e.g. HeartBtInt=1. The "QFJ Timer" thread wakes up every second and looks at all sessions to see who needs a HeartBeat to be sent. The SystemTime.currentTimeMillis() is read at the moment the outgoing message header is created. Suppose you have a "quiet" session that is only exchanging heartbeats, and one or more other "chatty" sessions that don't always need a Heartbeat to be sent because of other traffic being sent. Suppose at time T, all the sessions get a heartbeat, but as the heartbeats are sent by the single QFJ Timer thread, the last session's HeartBeat is sent at time T + 1 ms. This delay is a function of the CPU and MessageStore speed, not any network timings. Then suppose at time T + 1000, only the "quiet" session needs a heartbeat. But SessionState.isHeartBeatNeeded() only sees millisSinceLastSentTime = 999, so no heart beat is sent, and the counterparty has to send a Test Request to see if we're alive. Of course, we are alive and the TEST exchange occurs fine, but we look sloppy. There are already HeartBtInt "fudge factors" of 1.5 in isTestRequestNeeded() and 2.4 in isTimedOut(), I propose to introduce a 10 ms "leeway" value to the isHeartBeatNeeded() method. This fixes my problem. A patch against svn r923 is attached. |
Comments |
Comment by Christoph John [ 03/Apr/14 ] |
Comment by Glyn Walters [ 29/Aug/14 ] |
We have applied this patch to a branch of 1.5.3 we're using locally while awaiting release of 1.6.0. There is another factor to the accuracy of the heartbeat timings when intervals are short though, e.g. 4 seconds with Reuters. If the SessionTImerTask happens to be the one that calls next() on Session when a heartbeat is due, then hbeats are accurate for that period. But if the message processor is the one to call next() when a hbeat happens to be due, then the timings can go off for quite a while. The SessionTimerTask interval starts to get out of sync with when the next heartbeat period is due, and usually will then be triggered again by another incoming message, rather than the timer. With incoming messages coming at around the same frequency as the hbeat interval (because of received hbeats). The 'owner' thread of the current hbeat interval (i.e. the one most likely to trigger it) switches from time to time between SessionTimerTask and the Message Processor thread. Applying this patch mitigated some of the big gaps we could sometimes see. But if more accuracy is desired, this is the cause of some sub second jitter. Two examples: 20140829-12:49:26.809 INFO [QFJ Message Processor] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9819749=Sender52=20140829-11:49:26.80956=TR MATCHING57=FXM142=Target10=082 2/ When the SessionTImerTask triggered 20140829-13:26:47.727 INFO [QFJ Timer] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9875749=Sender52=20140829-12:26:47.72756=TR MATCHING57=FXM142=Target10=082 |
Comment by Christoph John [ 29/Aug/14 ] |
Hi Glyn, when you say "triggered by an incoming message", I guess you are mostly referring to an application message. Incoming messages also go through the fromApp and fromAdmin callbacks. Depending on your implementation there sometimes is much time spent inside those callbacks. And the heartbeat is only triggered after the callback has been completed. This might be the reason for some of the higher delays (I see about 70-90 ms delay in your example). Cheers, |
Comment by Glyn Walters [ 29/Aug/14 ] |
Hi Christopher. Thanks for the response. I think you might have misunderstood me though. It's because either thread calls next(), it makes the baseline time for the heartbeat unreliable. Say for example next() is called by QFJ Message Processor (caused by an incoming heartbeat app message) at 14:21:06.101 and a heartbeat is sent in the same millisecond. The next heartbeat due time is now 14:21:10.101. If the SessionTimerTask Executor happens to have its scheduled tasks firing at .050ms every second, the next time next() is called where state.isHeartBeatNeeded() is going to return true, is either at 14.21.11.050 (almost 1 second late), or some random time between 14.21.10.101 and then if another message happens to be processed in that time. If only the SessionTimerTask triggered calls to state.isHeartBeatNeeded(), then the calls (especially with this tolerance fix) would be very accurate. Because of the mathematics around how that plays with a 4 second hbeat from both sides and a one second SessionTimerTask interval, in a real world situation you see logs where the heartbeat timings are quite jittery (~1 second variance) when the inbound message processing queue has been the cause of the next() call which spawned a heartbeat. The examples I gave indicate periods when we see either inaccurate hbeat timings, just because of the statistical likelihood that once next() has been called, resulting in a hbeat, from an incoming message, that pattern will predominate for a while. Whearas the other example, from later on in the same sesson, is the pattern you get when the SessionTimerTask calls next() at the right time. It also then will predominate, but with much more accurate timings. Thanks |
Comment by Glyn Walters [ 29/Aug/14 ] |
Apologies, for spelling your name wrong! |
Comment by Christoph John [ 02/Sep/14 ] |
Hi Glyn, never mind. |