[QFJ-660] Sequence number file can become corrupted. Created: 11/Jan/12  Updated: 24/Aug/12  Resolved: 24/Aug/12

Status: Resolved
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.5.2
Fix Version/s: 1.5.3

Type: Bug Priority: Default
Reporter: Christopher Hurst Assignee: Christoph John
Resolution: Fixed Votes: 0
Labels: None
Environment:

Windows, QFJ under heavy load (performance test)



 Description   

I keep meaning to investigate this ...

I've seen this a few times and reported it on the forum with no response, basically it appears possible to end up with two sets of sequence numbers in the seqnum file (I assume that's not intentional).

I suspect if the engine were to fail at the exact time the bug occurs the current sequence numbers could be wrong ?, this should be very, very unlikely though potentially serious.

So Expected seqnum file :
<length><Seq 1>:<Seq2>

corrupted file ..
<length1><Seq 1>:<Seq 2><length2><Another Seq 1>:<Another Seq 2>

The session does not fail with such a corrupted file or show ill effect, I suspect possible session recovery would be the only issue and as the file can still be read only at the exact point the corruption occurred.

The only explanation I can think of is ..
I was wondering if this was a multithreading issue with FileStore.storeSequenceNumbers ? I keep meaning to check if this can be called by multiple threads for a single session as that would fit, ie two threads try to move to the start of the file and write at the same time ?



 Comments   
Comment by Christoph John [ 12/Jan/12 ]

Do you use the SingleThreadedEventHandlingStrategy or ThreadPerSessionEventHandlingStrategy?
From what I know, the FileStore is not thread-safe and concurrent calls to e.g. storeSequenceNumbers() could lead to strange results. The sequence numbers can be incremented by various next...() methods in the Session class. However, normally the session should only process one message at once. There is either one thread for all sessions, or a thread per session.

But one possible cause for this might be a concurrent call to one of the next...() methods and sendRaw(), i.e. messages are sent and received at the same time so that there are concurrent calls to storeSequenceNumbers(). But this is only speculation, I have not checked thoroughly if this is possible at all or if this is prevented by some means.

Comment by Christopher Hurst [ 12/Jan/12 ]

SingleThreadedEventHandlingStrategy .. I believe but I don't think that's relevant as the issue is in the engine threading ??

I think I have got to the bottom of the issue and it looks a lot more serious (unless I'm missing something ), I'm not sure the memory cache is consistent with the filestore, I would have thought changes to persistence numbers should be atomic (one complete change AND values published across threads 'visibility') but that doesn't appear to be the case. I think we may have a 'race' here also ...

I've trapped the two threads in memory (I vaguely guessed what they would be but wanted to check my facts), on is essentially the timer thread e.g. heart beats the other effectively business messages.

The following is the stack traces from two threads trying to manipulate sequence numbers at the same time
(to reproduce this just put a break point in the filestore though be careful as obviously this code is time sensitive)

If some one else can't fault my logic the priority of this might need raising, should be a very easy fix i.e. synchronization

Daemon Thread [QFJ Timer] (Suspended (breakpoint at line 403 in FileStore))
FileStore.storeSequenceNumbers() line: 403
FileStore.incrNextSenderMsgSeqNum() line: 288
SessionState.incrNextSenderMsgSeqNum() line: 359
Session.sendRaw(Message, int) line: 2382
Session.generateHeartbeat() line: 1806
Session.next() line: 1779
SessionConnector$SessionTimerTask.run() line: 251
Executors$RunnableAdapter<T>.call() line: 441
FutureTask$Sync.innerRunAndReset() line: 317
ScheduledThreadPoolExecutor$ScheduledFutureTask<V>(FutureTask<V>).runAndReset() line: 150

Daemon Thread [QFJ Message Processor] (Suspended (breakpoint at line 403 in FileStore))
FileStore.storeSequenceNumbers() line: 403
FileStore.setNextTargetMsgSeqNum(int) line: 280
SessionState.setNextTargetMsgSeqNum(int) line: 355
Session.nextSequenceReset(Message) line: 1338
Session.next(Message) line: 943
SingleThreadedEventHandlingStrategy$SessionMessageEvent.processMessage() line: 112
SingleThreadedEventHandlingStrategy.block() line: 75
SingleThreadedEventHandlingStrategy$1.run() line: 92
Thread.run() line: 662

ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) line: 98
ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.runPeriodic() line: 180
ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.run() line: 204
ThreadPoolExecutor$Worker.runTask(Runnable) line: 886
ThreadPoolExecutor$Worker.run() line: 908
Thread.run() line: 662

Comment by Christoph John [ 12/Jan/12 ]

Thanks. This supports my assumption from my previous comment: "concurrent call to one of the next...() methods and sendRaw()".

Comment by Christoph John [ 16/Jan/12 ]

As a workaround, maybe QFJ-552 helps for the time being.
Another workaround would be a custom file store which stores incoming and outgoing seq nums in separate files.

Comment by Christoph John [ 24/Aug/12 ]

The implementation will now be changed to use separate files for sender and target seqnums to avoid concurrent writes of SessionState.set/incrNextTargetMsgSeqNum and SessionState.set/incrNextSenderMsgSeqNum to the same file.

Of course this means that anyone accessing the sequence number files directly (NOT via MessageStore methods) will have a problem now.

Comment by Christoph John [ 24/Aug/12 ]

NB: The change means that sequence number files from an older version cannot be read with this version. So you shouldn't switch the QF/J version during the day and it's safest to delete the message store beforehand.

Generated at Sat Nov 23 09:18:38 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.