[QFJ-660] Sequence number file can become corrupted. Created: 11/Jan/12 Updated: 24/Aug/12 Resolved: 24/Aug/12 |
|
Status: | Resolved |
Project: | QuickFIX/J |
Component/s: | Engine |
Affects Version/s: | 1.5.2 |
Fix Version/s: | 1.5.3 |
Type: | Bug | Priority: | Default |
Reporter: | Christopher Hurst | Assignee: | Christoph John |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Environment: |
Windows, QFJ under heavy load (performance test) |
Description |
I keep meaning to investigate this ... I've seen this a few times and reported it on the forum with no response, basically it appears possible to end up with two sets of sequence numbers in the seqnum file (I assume that's not intentional). I suspect if the engine were to fail at the exact time the bug occurs the current sequence numbers could be wrong ?, this should be very, very unlikely though potentially serious. So Expected seqnum file : corrupted file .. The session does not fail with such a corrupted file or show ill effect, I suspect possible session recovery would be the only issue and as the file can still be read only at the exact point the corruption occurred. The only explanation I can think of is .. |
Comments |
Comment by Christoph John [ 12/Jan/12 ] |
Do you use the SingleThreadedEventHandlingStrategy or ThreadPerSessionEventHandlingStrategy? But one possible cause for this might be a concurrent call to one of the next...() methods and sendRaw(), i.e. messages are sent and received at the same time so that there are concurrent calls to storeSequenceNumbers(). But this is only speculation, I have not checked thoroughly if this is possible at all or if this is prevented by some means. |
Comment by Christopher Hurst [ 12/Jan/12 ] |
SingleThreadedEventHandlingStrategy .. I believe but I don't think that's relevant as the issue is in the engine threading ?? I think I have got to the bottom of the issue and it looks a lot more serious (unless I'm missing something ), I'm not sure the memory cache is consistent with the filestore, I would have thought changes to persistence numbers should be atomic (one complete change AND values published across threads 'visibility') but that doesn't appear to be the case. I think we may have a 'race' here also ... I've trapped the two threads in memory (I vaguely guessed what they would be but wanted to check my facts), on is essentially the timer thread e.g. heart beats the other effectively business messages. The following is the stack traces from two threads trying to manipulate sequence numbers at the same time If some one else can't fault my logic the priority of this might need raising, should be a very easy fix i.e. synchronization Daemon Thread [QFJ Timer] (Suspended (breakpoint at line 403 in FileStore)) Daemon Thread [QFJ Message Processor] (Suspended (breakpoint at line 403 in FileStore)) ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) line: 98 |
Comment by Christoph John [ 12/Jan/12 ] |
Thanks. This supports my assumption from my previous comment: "concurrent call to one of the next...() methods and sendRaw()". |
Comment by Christoph John [ 16/Jan/12 ] |
As a workaround, maybe QFJ-552 helps for the time being. |
Comment by Christoph John [ 24/Aug/12 ] |
The implementation will now be changed to use separate files for sender and target seqnums to avoid concurrent writes of SessionState.set/incrNextTargetMsgSeqNum and SessionState.set/incrNextSenderMsgSeqNum to the same file. Of course this means that anyone accessing the sequence number files directly (NOT via MessageStore methods) will have a problem now. |
Comment by Christoph John [ 24/Aug/12 ] |
NB: The change means that sequence number files from an older version cannot be read with this version. So you shouldn't switch the QF/J version during the day and it's safest to delete the message store beforehand. |