[QFJ-489] ResendRequest even the MsgSeqNum is correct when a lot of messages arrived at the same instance Created: 04/Dec/09 Updated: 15/Nov/12 Resolved: 05/Apr/10 |
|
Status: | Closed |
Project: | QuickFIX/J |
Component/s: | Engine |
Affects Version/s: | 1.4.0 |
Fix Version/s: | None |
Type: | Bug | Priority: | Critical |
Reporter: | lcryst | Assignee: | Unassigned |
Resolution: | Cannot Reproduce | Votes: | 0 |
Labels: | None | ||
Environment: |
WINXP |
Description |
When a lot of messages arrive at the same instance, Quickfix seems to ingore some message (they can be found in the logs), e.g. 34=2, 34=3, 34=4 has arrived at 20091203-08:47:53 [ i use the screenLogFactory] then the MsgSeqNum will be screwed up, keep sending resend request when more messages arrived. How to prevent quickfix from ignoring the message with a smaller MsgSeqNum? |
Comments |
Comment by Steve Bate [ 15/Jan/10 ] |
QFJ might not be ignoring the message. It's possible the message is corrupt and the engine incoming sequence number is not incremented. You don't mention whether you searched the application logs for any log messages related to message validation issues. It's not likely to be a race condition for a couple of reasons. QFJ has been tested under high load without any message loss and messages for a specific session are processed in sequence. |
Comment by robert engels [ 13/Jun/11 ] |
This issue needs to be re-opened, as we are having the exact same problem. Here is a snippet from the log that clearly demonstrates the issue: <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, incoming> (8=FIX.4.29=24035=b34=1195369=119752=20110606-07:29:22.81949=CME50=G56=EKY2Z7N57=BFHS143=US,IL58=Quote rejected: Not authorized to quote this instrument 'SenderComp EKY2Z7/ TraderID UNKNOWN for group ZE'117=101297=5300=91028=Y9772=09774=ZE10=015) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, incoming> (8=FIX.4.29=24035=b34=1196369=119752=20110606-07:29:22.82049=CME50=G56=EKY2Z7N57=BFHS143=US,IL58=Quote rejected: Not authorized to quote this instrument 'SenderComp EKY2Z7/ TraderID UNKNOWN for group U$'117=102297=5300=91028=Y9772=09774=U$10=189) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, error> (MsgSeqNum too high, expecting 1195 but received 1196) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, outgoing> (8=FIX.4.29=9135=234=119849=EKY2Z7N50=BFHS52=20110606-07:29:22.82656=CME57=G142=US,IL7=119516=010=092) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, event> (Sent ResendRequest FROM: 1195 TO: 0) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, event> (Processing queued message: 1196) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, event> (ResendRequest for messages FROM 1195 TO 1195 has been satisfied.) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, incoming> (8=FIX.4.29=27135=b34=1195369=119852=20110606-07:29:22.83043=Y49=CME50=G56=EKY2Z7N57=BFHS143=US,IL122=20110606-07:29:22.81958=Quote rejected: Not authorized to quote this instrument 'SenderComp EKY2Z7/ TraderID UNKNOWN for group ZE'117=101297=5300=91028=Y9772=09774=ZE10=021) <20110606-07:29:22, FIX.4.2:EKY2Z7N/BFHS/US,IL->CME/G, incoming> (8=FIX.4.29=27135=b34=1196369=119852=20110606-07:29:22.83143=Y49=CME50=G56=EKY2Z7N57=BFHS143=US,IL122=20110606-07:29:22.82058=Quote rejected: Not authorized to quote this instrument 'SenderComp EKY2Z7/ TraderID UNKNOWN for group U$'117=102297=5300=91028=Y9772=09774=U$10=196) This snippet clearly shows that message 1196 came in before 1195, even though the log shows 1195 first. Here's what happens I believe: 1. 1195 comes in and is logged, not yet on received queue We are using a SocketInitiator (which uses the SingleThreaded reader). If it is not a bug, can you explain what we could possibly be doing in our code to cause this behavior? Or what we might try to diagnose the problem? |
Comment by robert engels [ 13/Jun/11 ] |
I think the block() method in the SingleThreadedEventHandlingStrategy needs to be synchronized. Otherwise if multiple threads call block(), you can process message out of order. For example: Thread A is in block(), calls getMessage() - context switch The messages are now processed out of order. These context switches are more app to happen in a heavily stressed system, with lots of active threads (which is what the original reporter implies). Now, I don't think an outside thread should be calling block(), but it is public an it appears that some library level housekeeping code does. Could it be a timer type operation that is causing the problems? Why is block() even exposed? But you also have the same problem if two calls to blockInThread() are made, as there will be multiple reader threads, and that is a problem... Synchronizing block() would fix both of these cases, and is simple, but I am not sure it is the cause of the problem, or the correct solution. |
Comment by Steve Bate [ 13/Jun/11 ] |
The block method is exposed because it's part of the original QuickFIX JNI API. Some users apparently process their messages in the thread that calls block. A method being public doesn't imply it is thread safe and in this case it should never be called from multiple threads. The start method should also never be called from multiple threads. Do you think this is happening in your application? As a simple check, look for multiple QFJ Message Processor threads. This would happen if start were called from multiple threads. You can also do a thread dump and look for multiple occurrences of threads with the block method in the stack trace. If this is the problem, we could add a guard in the start method to throw an exception if it is called from multiple threads. |
Comment by robert engels [ 13/Jun/11 ] |
I have done just that and I see only a single processor thread - at least in standard usage. I am afraid there may be a case, whether our code, or in the library, where there is some timeout condition (that gets triggered under heavy stress), and multiple threads are created then. I've looked through our code, and can see no place where this could occur, but I'll admit that it is somewhat complex. The issue is that we never saw this with 1.3.0, only after upgrading to 1.5.0. I'm afraid it has something to do with session handling/identification and the new method in 1.3.1+ is a contributing factor. Regardless, since the block() is really internal, I don't think there is any performance problem of making it synchronized (since there should only be a single thread in it anyway!), and it would guard against processing messages out of order in ANY case. |