[QFJ-917] Race condition between SocketAcceptorIoProcessor & QFJ Message Processor Created: 08/Mar/17 Updated: 27/Jul/17 Resolved: 29/Mar/17 |
|
Status: | Closed |
Project: | QuickFIX/J |
Component/s: | Build |
Affects Version/s: | 1.5.3 |
Fix Version/s: | 1.6.4 |
Type: | Bug | Priority: | Default |
Reporter: | Paul Rowe | Assignee: | Paul Rowe |
Resolution: | Fixed | Votes: | 0 |
Labels: | logout, session | ||
Environment: |
Linux |
Description |
We experienced a race condition where Application.onLogout() method was not called. On a Session, we issue a logout from another thread. The counterparty will then respond with a logout message and disconnect. When this happens, there are two threads in play, the "QFJ Message Processor" thread that processes the counterparty logout message, and another "SocketAcceptorIoProcessor" thread spawned by Mina when the socket disconnects. Both of these threads ultimately fall into the Session.disconnect(String,boolean) method. The sequence of events go like this in Session.disconnect():
I took a look at Can you please confirm if this is a current bug in the code and/or if it has been fixed by another jira that I did not see ? |
Comments |
Comment by Paul Rowe [ 08/Mar/17 ] |
Relevant logs [2017-02-28 22:30:00,118Z][INFO][QFJ Message Processor:0x19] Received logout response [2017-02-28 22:30:00,119Z][INFO][SocketAcceptorIoProcessor-138.0:0x19] [FIX.4.4:TS->CLIENT] Disconnecting: IO Session closed [2017-02-28 22:30:00,127Z][INFO][QFJ Message Processor:0x19] Already disconnected: Received logout response It is a very subtle race condition that happens in the disconnect method. Just after the "SocketAcceptorIoProcessor" thread releases the responderSync lock, a context switch happens to the "QFJ Message Processor" thread. The "QFJ Message Processor" thread will see there is no responder, and returns through the finally block clearing the session state flags. After the return, there is a context switch back to the "SocketAcceptorIoProcessor" thread where it attempts to read the logon state flags, but they have been cleared by the other thread. The trick is getting a context switch to happen just after the responderSync synchronized block. |
Comment by Christoph John [ 08/Mar/17 ] |
Hi Paul Rowe, there was a pull request just recently being created that should work around this issue: https://github.com/quickfix-j/quickfixj/pull/102 Cheers, |
Comment by Paul Rowe [ 08/Mar/17 ] |
Hi Chris, thanks for the response. I reviewed the pull request. The thing that I am currently struggling with now is a unit test. Any unit tests for this laying around ? |
Comment by Christoph John [ 08/Mar/17 ] |
Hi Paul Rowe, I don't have a unit test ready. However, after browsing the existing ones it might be reproducible by the same technique as used in SessionResetTest. There is a PausableThreadPoolExecutor to which you can submit two threads. Each of the threads should then call Session.disconnect() concurrently. IMHO this should exhibit the problematic behaviour (maybe not always, though). Cheers, |
Comment by Paul Rowe [ 08/Mar/17 ] |
OK I will see if I can whip one up and if so I will also submit a pull request. Thanks. |
Comment by Christoph John [ 08/Mar/17 ] |
Great, thanks. |
Comment by Paul Rowe [ 09/Mar/17 ] |
I think the pull request generated by another QF user had issues (I commented on them there), so I generated my own pull request and submitted it to the QF team for review. |