[QFJ-448] Skipped heartbeats Created: 30/Jun/09  Updated: 02/Apr/15  Resolved: 03/Apr/14

Status: Closed
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.3.1, 1.3.2, 1.3.3, 1.4.0
Fix Version/s: 1.6.0

Type: Bug Priority: Default
Reporter: Scott Harrington Assignee: Christoph John
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File QFJ-448.patch    

 Description   

A race condition is causing occasional skipped heartbeats.

The likelihood is increased by (a) multiple sessions, (b) slow message store, (c) frequent heartbeats e.g. HeartBtInt=1.

The "QFJ Timer" thread wakes up every second and looks at all sessions to see who needs a HeartBeat to be sent. The SystemTime.currentTimeMillis() is read at the moment the outgoing message header is created.

Suppose you have a "quiet" session that is only exchanging heartbeats, and one or more other "chatty" sessions that don't always need a Heartbeat to be sent because of other traffic being sent.

Suppose at time T, all the sessions get a heartbeat, but as the heartbeats are sent by the single QFJ Timer thread, the last session's HeartBeat is sent at time T + 1 ms. This delay is a function of the CPU and MessageStore speed, not any network timings.

Then suppose at time T + 1000, only the "quiet" session needs a heartbeat.

But SessionState.isHeartBeatNeeded() only sees millisSinceLastSentTime = 999, so no heart beat is sent, and the counterparty has to send a Test Request to see if we're alive.

Of course, we are alive and the TEST exchange occurs fine, but we look sloppy.

There are already HeartBtInt "fudge factors" of 1.5 in isTestRequestNeeded() and 2.4 in isTimedOut(), I propose to introduce a 10 ms "leeway" value to the isHeartBeatNeeded() method.

This fixes my problem. A patch against svn r923 is attached.



 Comments   
Comment by Christoph John [ 03/Apr/14 ]

http://sourceforge.net/p/quickfixj/code/1171/

Comment by Glyn Walters [ 29/Aug/14 ]

We have applied this patch to a branch of 1.5.3 we're using locally while awaiting release of 1.6.0. There is another factor to the accuracy of the heartbeat timings when intervals are short though, e.g. 4 seconds with Reuters. If the SessionTImerTask happens to be the one that calls next() on Session when a heartbeat is due, then hbeats are accurate for that period. But if the message processor is the one to call next() when a hbeat happens to be due, then the timings can go off for quite a while. The SessionTimerTask interval starts to get out of sync with when the next heartbeat period is due, and usually will then be triggered again by another incoming message, rather than the timer.

With incoming messages coming at around the same frequency as the hbeat interval (because of received hbeats). The 'owner' thread of the current hbeat interval (i.e. the one most likely to trigger it) switches from time to time between SessionTimerTask and the Message Processor thread. Applying this patch mitigated some of the big gaps we could sometimes see. But if more accuracy is desired, this is the cause of some sub second jitter.

Two examples:
1/ When an incoming message triggered a heartbeat

20140829-12:49:26.809 INFO [QFJ Message Processor] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9819749=Sender52=20140829-11:49:26.80956=TR MATCHING57=FXM142=Target10=082
20140829-12:49:30.890 INFO [QFJ Message Processor] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9819849=Sender52=20140829-11:49:30.88956=TR MATCHING57=FXM142=Target10=086
20140829-12:49:34.971 INFO [QFJ Message Processor] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9819949=Sender52=20140829-11:49:34.97156=TR MATCHING57=FXM142=Target10=083
20140829-12:49:39.054 INFO [QFJ Message Processor] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9820049=Sender52=20140829-11:49:39.05456=TR MATCHING57=FXM142=Target10=063
20140829-12:49:43.118 INFO [QFJ Message Processor] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9820149=Sender52=20140829-11:49:43.11856=TR MATCHING57=FXM142=Target10=060
20140829-12:49:47.208 INFO [QFJ Message Processor] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9820249=Sender52=20140829-11:49:47.20856=TR MATCHING57=FXM142=Target10=065

2/ When the SessionTImerTask triggered

20140829-13:26:47.727 INFO [QFJ Timer] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9875749=Sender52=20140829-12:26:47.72756=TR MATCHING57=FXM142=Target10=082
20140829-13:26:51.728 INFO [QFJ Timer] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9875849=Sender52=20140829-12:26:51.72856=TR MATCHING57=FXM142=Target10=079
20140829-13:26:55.727 INFO [QFJ Timer] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9875949=Sender52=20140829-12:26:55.72756=TR MATCHING57=FXM142=Target10=083
20140829-13:26:59.727 INFO [QFJ Timer] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9876049=Sender52=20140829-12:26:59.72756=TR MATCHING57=FXM142=Target10=079
20140829-13:27:03.728 INFO [QFJ Timer] quickfixj.msg.outgoing - FIXT.1.1:Sender/Target->TR MATCHING/FXM: 8=FIXT.1.19=9935=034=9876149=Sender52=20140829-12:27:03.72856=TR MATCHING57=FXM142=Target10=071

Comment by Christoph John [ 29/Aug/14 ]

Hi Glyn,

when you say "triggered by an incoming message", I guess you are mostly referring to an application message. Incoming messages also go through the fromApp and fromAdmin callbacks. Depending on your implementation there sometimes is much time spent inside those callbacks. And the heartbeat is only triggered after the callback has been completed. This might be the reason for some of the higher delays (I see about 70-90 ms delay in your example).

Cheers,
Chris.

Comment by Glyn Walters [ 29/Aug/14 ]

Hi Christopher.

Thanks for the response. I think you might have misunderstood me though. It's because either thread calls next(), it makes the baseline time for the heartbeat unreliable. Say for example next() is called by QFJ Message Processor (caused by an incoming heartbeat app message) at 14:21:06.101 and a heartbeat is sent in the same millisecond. The next heartbeat due time is now 14:21:10.101. If the SessionTimerTask Executor happens to have its scheduled tasks firing at .050ms every second, the next time next() is called where state.isHeartBeatNeeded() is going to return true, is either at 14.21.11.050 (almost 1 second late), or some random time between 14.21.10.101 and then if another message happens to be processed in that time. If only the SessionTimerTask triggered calls to state.isHeartBeatNeeded(), then the calls (especially with this tolerance fix) would be very accurate.

Because of the mathematics around how that plays with a 4 second hbeat from both sides and a one second SessionTimerTask interval, in a real world situation you see logs where the heartbeat timings are quite jittery (~1 second variance) when the inbound message processing queue has been the cause of the next() call which spawned a heartbeat. The examples I gave indicate periods when we see either inaccurate hbeat timings, just because of the statistical likelihood that once next() has been called, resulting in a hbeat, from an incoming message, that pattern will predominate for a while. Whearas the other example, from later on in the same sesson, is the pattern you get when the SessionTimerTask calls next() at the right time. It also then will predominate, but with much more accurate timings.

Thanks
Glyn

Comment by Glyn Walters [ 29/Aug/14 ]

Apologies, for spelling your name wrong!

Comment by Christoph John [ 02/Sep/14 ]

Hi Glyn, never mind.
Feel free to open a JIRA ticket for that topic. Maybe it is the best to do as you suggested, i.e. only the SessionTimerTask should trigger heartbeats. Maybe it is also feasible to generate the heartbeats by a whole new thread/task? But that discussion could be done on the new ticket.

Generated at Sat Nov 23 11:05:26 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.