[QFJ-880] qfj doesn't send the next batch when ResendRequestChunkSize > 0 Created: 26/Feb/16  Updated: 10/Oct/18

Status: Open
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.6.0
Fix Version/s: None

Type: Other Priority: Default
Reporter: Xiaojun Zhang Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: CME, resend
Environment:

FIX4.2



 Description   

Quickfixj seems to send the 1st resend batch only after receiving a seq reset.

i.e.
ResendRequestChunkSize=10

15:28:37.142 [QFJ Message Processor] - fromAdmin PNPBBBN: 8=FIX.4.29=10735=134=44749=CME50=G52=20160225-20:28:37.22356=PNPBBBN57=DROPCOPY143=US,NY369=45621112=IDiky5miuo10=025

15:28:37.142 [QFJ Message Processor] - toAdmin PNPBBBN: 8=FIX.4.29=10735=034=4562249=PNPBBBN50=DropCopy52=20160225-20:28:37.14256=CME57=G142=US,NY143=CME112=IDiky5miuo10=004

I manually reduced the incoming seq no to 428
15:29:07.969 [Thread-30] - command: "fix PNPBBBN 428"

15:29:08.120 [QFJ Message Processor] - toAdmin PNPBBBN: 8=FIX.4.29=10535=234=4562449=PNPBBBN50=DropCopy52=20160225-20:29:08.12056=CME57=G142=US,NY143=CME7=42816=43710=188

15:29:08.148 [QFJ Message Processor] - fromAdmin PNPBBBN: 8=FIX.4.29=13635=434=42843=Y49=CME50=G52=20160225-20:29:08.25556=PNPBBBN57=DROPCOPY122=20160225-20:29:08.255143=US,NY369=4562436=449123=Y10=001

I expect quickfix to send the next batch (438-447) but it sent a normal heartbeat instead.

15:29:38.704 [QFJ Timer] - toAdmin PNPBBBN: 8=FIX.4.29=9235=034=4562549=PNPBBBN50=DropCopy52=20160225-20:29:38.70456=CME57=G142=US,NY143=CME10=069



 Comments   
Comment by Christoph John [ 26/Feb/16 ]

Sorry, this is a little hard to follow. I don't know about the command "fix PNPBBBN". Is this some custom stuff that you are executing? Actually, I would expect QFJ to send a Logout if it received too low a sequence number. The last log line is half an hour later than the log line before. What happened in between?

Actually, there are some unit tests around the chunked resend requests behaviour. But of course, this does not mean that there could be no bugs inside that portion of the code. But I cannot really follow your example. Do you have a unit test or some more concise steps which can be followed?

Comment by Xiaojun Zhang [ 26/Feb/16 ]

Sorry the log is a bit hard to read.
1. "fix PNPBBBN" is an admin command I created which calls session.setNextTargetMsgSeqNum() so I can manipulate the seq no to trigger the scenario.
2. The log started from 15:28:37 to 15:29:38. You can ignore the UTC time in the FIX message.
3. A simpler example is
1) both sides seq no started from 1.
2) for some reasons client didn't receive some messages
3) the next message client received was seq no 20
4) chunk size was set to 10. so client sent a resent request being=2 end =11
5) client received seq reset with GapFill=Y
6) client should send the next resend request begin=2 end=20 but it didn't

Hope this helps.

Comment by Xiaojun Zhang [ 26/Feb/16 ]

Sorry there was a typo in 6) client should send the next resend request begin=12 end=20 but it didn't

Comment by Christoph John [ 26/Feb/16 ]

If you look at the incoming sequence reset you see the field 36/NewSeqNo set to 449. So that is why there are no more messages resent. This should actually also be a message in your event log. Something along the lines of: "received sequence reset to 449."

Comment by Xiaojun Zhang [ 26/Feb/16 ]

Hmm i am reading the FIX doc.
The message in all situations specifies NewSeqNo <36> to reset as thevalue of the next sequence numberimmediately following the messages and/or sequence numbers being skipped.
It seems <36> just indicates the next seq no I will receive. Does it mean the whole gap has been filled? In order to send the next resend request what <36> should be?

Actually I was connecting to CME drop copy and encountered this issue.

Comment by Christoph John [ 26/Feb/16 ]

If you send a resend request and the counterparty sends SequenceReset with NewSeqNo = 449 then QFJ will continue with that incoming seqnum. That means that the gap has been filled up to that seqnum, yes. Probably there were only session messages in between, or messages that CME does not want to resend.

Comment by Christoph John [ 26/Feb/16 ]

NB: I think this belongs more onto the mailing list or with CME support than here. It's no bug after all.

Comment by Xiaojun Zhang [ 27/Feb/16 ]

Sorry I still don't quite understand the logic here.
Let's say the seq gap is 1-20 and chunk size is set to 10. Application messages are in 11-20 but NOT in 1-10. After QFJ sends the 1st resend request from 1 to 10 what NewSeqNo should be in SequenceReset?
I don't think it will be 11 because seq no shouldn't decrease.
If it is 21 QFJ will continue with 21 and lose all messages from 11 to 20.

If my understanding is not correct can you please give me an example how chunked resend should work? Appreciate your help.

Comment by Xiaojun Zhang [ 01/Mar/16 ]

If you don't think it is a bug I will add code in my Application class to override the logic. I just want to point out that current logic does NOT work for CME iLink and Drop Copy 4.0 which place a 2500 limit on resend request. QFJ would lose all messages after the first batch. It makes more sense to assume the corresponding chunk gap has been filled based on the sequence reset, not the whole gap.

Comment by Christoph John [ 01/Mar/16 ]

Sorry, I totally forgot about this one. What logic do you want to override? Actually, you do not need to override any of the session logic.

It seems <36> just indicates the next seq no I will receive. Does it mean the whole gap has been filled? In order to send the next resend request what <36> should be?

The NewSeqNo on a sequence reset is the next-to-be sequence number of the other communication side.

Let's say the seq gap is 1-20 and chunk size is set to 10. Application messages are in 11-20 but NOT in 1-10. After QFJ sends the 1st resend request from 1 to 10 what NewSeqNo should be in SequenceReset?

NewSeqNo should be 11.

I don't think it will be 11 because seq no shouldn't decrease.
If it is 21 QFJ will continue with 21 and lose all messages from 11 to 20.

Why does the seq no decrease? It is as follows: you connect to the counterparty. Your expected target seq no is 1. The other side comes in with 20. Now you send a resend request from 1-10.
You receive a SeqReset with NewSeqNo 11. Now your expected target seq no is 11. This is an increment to 1 and not a decrement.
You issue another ResendRequest for 11-20 and receive the app messages, each will increment your expected target seq no by 1.

I know that there are several users which use QFJ to connect to CME. The last issue around chunked resend requests I remember was QFJ-751 which was fixed in QF/J 1.6.0.

Please write a mail to the mailing list if you have further questions. https://lists.sourceforge.net/lists/listinfo/quickfixj-users
Or please attach a failing unit test so that I can reproduce the error.

Thanks.

Comment by John [ 07/Apr/16 ]

I can confirm that I am seeing the same behavior when connecting to the CME's Drop Copy 4.0. It appears this issue arises due to the way in which the CME increments tag 36 of SequenceReset messages and the way QuickFIX/J handles the sending of ResendRequest messages.

The following code is an excerpt from the Session class in the quickfix package which I believe is where the logic breaks down. It appears that QuickFIX/J will attempt to send chunks of ResendRequests depending upon when it receives incoming SequenceReset messages.The code below requires the value of the newSequence variable to be less than the EndSeqNo value defined in the range of messages that the client is supposed to fetch. This case fails and some ResendRequest messages will not be sent. Would it be safe to remove this logic from the if statement allowing subsequent RequestsRequests to be sent to the counterparty despite the fact that the newSequence value will be past the range of messages defined in the range variable? If so, the code would also need to be modified to not leverage the value of the newSequence variable when sending the ResendRequest to define the range of sequence numbers to send in the subsequent ResendRequest.

newSequence < range.getEndSeqNo()

private void nextSequenceReset(Message sequenceReset) throws IOException, RejectLogon,
FieldNotFound, IncorrectDataFormat, IncorrectTagValue, UnsupportedMessageType {
boolean isGapFill = false;
if (sequenceReset.isSetField(GapFillFlag.FIELD))

{ isGapFill = sequenceReset.getBoolean(GapFillFlag.FIELD) && validateSequenceNumbers; }

if (!verify(sequenceReset, isGapFill, isGapFill))

{ return; }

if (validateSequenceNumbers && sequenceReset.isSetField(NewSeqNo.FIELD)) {
final int newSequence = sequenceReset.getInt(NewSeqNo.FIELD);

getLog().onEvent(
"Received SequenceReset FROM: " + getExpectedTargetNum() + " TO: "
+ newSequence);
if (newSequence > getExpectedTargetNum()) {
state.setNextTargetMsgSeqNum(newSequence);
final ResendRange range = state.getResendRange();
if (range.isChunkedResendRequest()) {
if (newSequence >= range.getCurrentEndSeqNo()
&& newSequence < range.getEndSeqNo())

{ // If new seq no is beyond the range of the current chunk // and if we are not done with all resend chunks, // we send out a ResendRequest at once. // Alternatively, we could also wait for the next incoming message // which would trigger another resend. final String beginString = sequenceReset.getHeader().getString( BeginString.FIELD); sendResendRequest(beginString, range.getEndSeqNo() + 1, newSequence + 1, range.getEndSeqNo()); }

}
// QFJ-728: newSequence will be the seqnum of the next message so we
// delete all older messages from the queue since they are effectively skipped.
state.dequeueMessagesUpTo(newSequence);
} else if (newSequence < getExpectedTargetNum()) {

getLog().onErrorEvent(
"Invalid SequenceReset: newSequence=" + newSequence + " < expected="
+ getExpectedTargetNum());
if (resetOrDisconnectIfRequired(sequenceReset))

{ return; }

generateReject(sequenceReset, SessionRejectReason.VALUE_IS_INCORRECT,
NewSeqNo.FIELD);
}
}
}

Comment by Xiaojun Zhang [ 07/Apr/16 ]

John, I received your email. Do you have a non-personal email address? My corporate email doesn't allow sending to gmail, hotmail, etc.

Comment by Christoph John [ 08/Apr/16 ]

I very briefly read through http://www.cmegroup.com/confluence/display/EPICSANDBOX/Drop+Copy+Session+Layer+-+Resend+Request and I do not see anything that is contradictory to how QFJ interprets the NewSeqNo tag. However, it seems that CME is implementing something very custom. A test case would definitely help here.
One thing I noticed though, is that they mention "duplicate resend requests". Maybe you can try if the configuration SendRedundantResendRequests=Y helps in that case?

Comment by John [ 09/Apr/16 ]

Chrstoph,

What format would you prefer to see a test case in? I could provide the FIX messaging log so that it might be easier to see how the CME is increasing the sequence numbers in the SequenceReset messages they send in the middle of sending ResendRequest chunks past the end of the original ResendRequest range such that QuickFIX/J defines after logging on which causes some of the ResendRequest chunks to not be sent to the exchange.

As I mentioned in my earlier post, I believe the issue is in the code snippet I posted above where upon receiving SequenceReset messages QuickFIX/J will determine if a new ResendRequest chunk message needs to get sent out based on the following criteria in the nextSequenceReset method of the Session class by checking if newSequence < range.getEndSeqNo(). Since this fails, some ResendRequests fail to ever get sent out.

Say the defined ResendRequest range is 1-10000.
1) I send a ResendRequest for messages 1-2500. I receive these messages and then receive a SequenceReset to 2520 for example.
2) I then send a ResendRequest for messages 2501-5000. I receive these messages and then receive a SequenceReset to 5050.
3) I send a ResendRequest for messages 5001-7500. I receive these messages and then receive a SequenceReset of 11000.
At this point the newSequence < range.getEndSeqNo() test fails and QuickFIX/J never sends the final ResendRequest for messages 7501-10000. The FIX connection continues to receive real-time trades etc as per normal.

I would assume this to be custom logic that the CME is implementing on their end. I think the difference comes down to the fact that QuickFIX/J appears to rely more on the newSequence number that is received on an incoming SequenceReset message to dictate whether further ResendRequest chunks will be sent out and the CME may send out a SequenceReset message taking you past the originally defined EndSeqNo range.

I am not a FIX expert so I can't say if this solution would work but it seems that if the logic for sending chunks of ResendRequest messages is changed such that each of the ResendRequest chunk messages is defined immediately after the need for a ResendRequest arises and put into a queue (or some other object) QuickFIX/J could then pop the each ResendRequest message off the queue once it receives all of the messages for a given ResendRequest from the exchange and continue onto the next one and decouple itself in that way from the newSequenceNumber. This might ensure that any of the chunks of ResendRequest messages that need to get sent out don't get skipped. This might have negative implications as it might not be the standard logic for communicating via FIX however so it might not be a feasible solution however.

It would be really great if this functionality could be modified in QuickFIX/J as this is an extremely useful engine however for users attempting to use QuickFIX/J with CME's Drop Copy 4 this case will cause problems in a recovery scenario.

Comment by Christoph John [ 29/Apr/16 ]

Hi John,

yes, a message log would be a good starting point. From this I should hopefully be able to construct an automated test case.

Thanks, Chris.

Comment by Christoph John [ 26/Aug/16 ]

Hi John, Xiaojun Zhang: is there still the need to support this? Do you have any message logs?
Thanks, Chris.

Comment by John [ 01/Sep/16 ]

Hi Christoph John,

I can confirm that there is a need to support this functionality. To be clear, I was experiencing this issue when working with the 1.6.0 release of QuickFIX/J and have not tested with more recent releases. I don't have any message logs available currently aside from what Xiaojun posted earlier in this issue.

Thanks

Comment by Christoph John [ 01/Sep/16 ]

Hi John, OK, if this functionality was implemented could you test it against CME to verify if it works? Do they have some sort of certification test?

Comment by Dmitry Razumov [ 10/Oct/18 ]

Hey John
I know it's been two years, but did you pass the certification in the end?

Generated at Sat Nov 23 00:37:11 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.