[QFJ-999] Issue with infinite replay request Created: 03/Feb/21 Updated: 07/Feb/21 Resolved: 07/Feb/21 |
|
Status: | Closed |
Project: | QuickFIX/J |
Component/s: | None |
Affects Version/s: | 2.1.0 |
Fix Version/s: | None |
Type: | Bug | Priority: | Default |
Reporter: | Dan Marques | Assignee: | Christoph John |
Resolution: | Not a bug | Votes: | 0 |
Labels: | Reconnect, resend | ||
Environment: |
Java 8 / Linux |
Attachments: | issue-with-gw-restart-test.txt |
Description |
I believe I've seen evidence of some sort of issue with infinite replay requests. I've attached an the annotated log of the situation. The log is from "EXCHANGE", which is a QFJ 2.1.0 application; the other party of the session, "BROKER_GATEWAY", is not QFJ. Both applications are on the same host, so the message time stamps correlate. Essentially, what occurs in this scenario is that there is a disconnect - and while the FIX session is disconnected, EXCHANGE generates 3 execution reports (er...10, er...11, and er...12). When the session reconnects, EXCHANGE sends a logon with a sqn 12 that is greater than what BROKER_GATEWAY was expected (it was expecting sqn 11), and follows with sqn 13 (er...11) and sqn 14 (er...12) While EXCHANGE is sending those "queued" messages, BROKER_GATEWAY issues a resend-request, for messages starting at sqn 11 and with an infinite length. EXCHANGE responds by replaying sqns 11 (er...10), 12 (admin msg as a gap-fill), and 13 (er...11), but never replays 14 (er...12) It isn't until 30 seconds later, that another message, a heartbeat, from EXCHANGE with sqn 15 triggers a second replay request, and 14 (er...12) is replayed. It seems as if there is some sort of race, where the resend request arrives after sqn 14 was "triggered" to be sent, but before the internal state used to prepare a resent request has been updated. |
Comments |
Comment by Christoph John [ 05/Feb/21 ] |
Thanks for the detailed and annotated log. I will try to create a unit test with this. |
Comment by Christoph John [ 05/Feb/21 ] |
Please correct me if I get it wrong but here is my interpretation of the log. To me everything looks correct from QFJ perspective. EXCHANGE gets the ResendRequest after sending out seqno 13, so seqno 14 is a message not affected by resend processing. What BROKER_EXCHANGE side should do is queue any messages that they cannot yet process due to seqno-too-high issues. After the resend processing is complete, the other side should process the queued messages. In my opinion only report with seqno 11 from the EXCHANGE was generated during disconnection. If seqno 13 and 14 were also resent from the message store (where they reside when there is a disconnection) they would get sent with 43/PossDup and 122/OrigSendingTime as a result from a ResendRequest. What do you think? |
Comment by Dan Marques [ 05/Feb/21 ] |
I think you are correct about the timing of 13 and 14 (being generated after the reconnect), but I don't think the log shows QFJ behaving correctly. It seems to me that the correct behavior for QFJ, upon receiving a resend request from 11 to infinity, would be to satisfy that request before sending any new messages. So, in our example, if QFJ got the request before sending 14, it should replay 11, 12 and 13, and then send 14. If on the other hand, it got the request after sending 14, it should have replayed 11, 12, 13, and 14. What we see instead, is it sending 14, following by a replay of 11, 12, and 13. To me, this looks like some sort of race condition, where 14 is "being sent' right as the request comes in - and there is race between the sending code updating the "highest-sqn-counter", and the replay response code reading that value.
While queuing 14 would "hide" this bug, it seems like QFJ is imposing behavior on the receiving application which isn't mandated by the FIX 4.2 spec. In fact, it seem to run contrary to the following (FIX 4.2 with Errata 20010501, page 30) :
The quote above says nothing about queuing 8 and 9, and instead says that those messages should be "ignored" and either explicitly (7 - 9) or implicitly (7 - 0) requested for a resend. As the receiver is going to process messages one at a time, when it determines 7 is missing (upon receiving 8), it will not have any knowledge of 9 - it doesn't know if 9 has already been read from the (local) socket, on the wire, in the sender's message store, or yet to be generated by the sender application. This is completely analogous to our example: 11 is missing, and 12-14 are received. The receiver only knows that 11 is missing when it receives 12, it knows nothing yet about 13 and 14. As soon as that gap is detected, it asks for a resend, from 12 - infinity, as per the recommended approach. From you description (i.e. if there isn't a bug), it seems that QFJ treats 13 differently from 14, it resends 13 but not 14, based on when it gets the resend-request. However, it impossible for the receiver to know that QFJ is going to treat those differently - when it requested the resend (after receiving 12) it there is no way for it to know that (from QFJ's point of view) 13 had been sent but 14 hadn't. For the receiver to correctly handle QFJs behavior it seems that it would need to queue both 13 and 14. That seems clearly outside the behavior recommended by the spec. |
Comment by Christoph John [ 06/Feb/21 ] |
Let me first comment on your statements:
Why should QFJ do that? On large message backlogs this could lead to QFJ only satisfying the ResendRequest and queueing all current messages. Instead, new messages could be interleaved with old ones. Of course, the receiving side would need to queue the newer messages until the gap is filled. But this is common practice (see further below in my comment).
This will not work in situations when new messages are generated "at the same time" and concurrent ResendRequest processing is going on. Now to the specification part: And although we are talking about a FIX4.2 connection, the session-level specification has since been refined and clarified. That is why you should always check up session-level stuff in the most current spec. However, even the relevant part I mentioned is in the FIX4.2 spec. Please see page 18 (or search for "queue" )
By the way, there also is a FIX session state matrix which lists various states of the connection. However, it is not contained in the FIX4.2 spec. IIRC it was in the FIX4.4 spec. Or even better, you can check it online here: https://www.fixtrading.org/standards/fix-session-layer-online/#fix-session-state-matrix Relevant parts are 11 and 12:
conclusion By the way, you might want to look at using NextExpectedMsgSeqNum on Logon (if your counterparty supports it), since this will eliminate the additional ResendRequest roundtrip on Logon and hence also the "racy" conditions. However, as I just realized it is only available from FIX4.4 and up. If you still have questions please feel free to ask. Cheers, |
Comment by Dan Marques [ 07/Feb/21 ] |
Chris, Thanks for your response. As to your point:
If your interpretation was what the spec's authors intended, I don't see why they would explicitly state that the receiver should "ask for a resend of 7-9". Under your interpretation, it would need a resend of 7 only, as it could then consume 8 and 9 from the queue. As to the portion of the spec you referenced:
That behavior is explicitly described as "recommended", as opposed to "required" or "mandatory" or some similar language. Therefore an application that does not store those message can still be correct. That said, the second sentence of that quote is illustrative.
What I now think is, if QFJ's behavior is correct with regard to the spec, a connecting application which chose not to store out of sequence messages would need to issue multiple, overlapping, resend requests, until it was caught up. For example, in my original log, upon receiving sqn 14, the receiver application should have issued a second request for 11 - 0. As that would have arrived at the QFJ sender after 14 was sent, it would have initiated a second replay, this time from 11- 14. As per that quote, messages 11, 12 and 13 would have been resent twice - safe, but inefficient. Again, thanks for your time. Dan |
Comment by Christoph John [ 07/Feb/21 ] |
Hi Dan, the reason why a ResendRequest for a single message should be avoided is explained in the next sentence in the spec:
So yes, in this part the spec does not explicitly state that the message should be queued but rather focuses on how to avoid race conditions between two parties which both are doing gap fills. You are right that the other FIX engine still works in this case. As I said the other FIX engine should be corrected and not must be corrected. And as I also said the situation corrects itself after the next message. Do you think that this issue can be closed? Thanks, |
Comment by Dan Marques [ 07/Feb/21 ] |
Chris, You can close it. Thanks. Dan |