[QFJ-917] Race condition between SocketAcceptorIoProcessor & QFJ Message Processor Created: 08/Mar/17  Updated: 27/Jul/17  Resolved: 29/Mar/17

Status: Closed
Project: QuickFIX/J
Component/s: Build
Affects Version/s: 1.5.3
Fix Version/s: 1.6.4

Type: Bug Priority: Default
Reporter: Paul Rowe Assignee: Paul Rowe
Resolution: Fixed Votes: 0
Labels: logout, session
Environment:

Linux



 Description   

We experienced a race condition where Application.onLogout() method was not called.

On a Session, we issue a logout from another thread. The counterparty will then respond with a logout message and disconnect. When this happens, there are two threads in play, the "QFJ Message Processor" thread that processes the counterparty logout message, and another "SocketAcceptorIoProcessor" thread spawned by Mina when the socket disconnects. Both of these threads ultimately fall into the Session.disconnect(String,boolean) method. The sequence of events go like this in Session.disconnect():

  1. "SocketAcceptorIoProcessor" sets the responder to null
  2. "QFJ Message Processor" sees that the responder is null and returns from Session.disconnect(). The body of the Session.disconnect() method is wrapped in a try..finally block, so when "QFJ Message Processor" exits this method via the return, it sets logonReceived and logonSent flags to false.
  3. "SocketAcceptorIoProcessor" reaches if statement that checks logonReceived and logonSent, these flags were false because "QFJ Message Processor" cleared them when it exited previously. As a result, Application.onLogout() is never executed.

I took a look at QFJ-803, QFJ-810 and QFJ-790 and they aren't quite the same issue as the one we experienced. I also pulled a latest copy of the QuickFix code from master, and it appears the current code could be potentially affected by this issue.

Can you please confirm if this is a current bug in the code and/or if it has been fixed by another jira that I did not see ?



 Comments   
Comment by Paul Rowe [ 08/Mar/17 ]

Relevant logs

[2017-02-28 22:30:00,118Z][INFO][QFJ Message Processor:0x19] Received logout response
[2017-02-28 22:30:00,119Z][INFO][SocketAcceptorIoProcessor-138.0:0x19] [FIX.4.4:TS->CLIENT] Disconnecting: IO Session closed
[2017-02-28 22:30:00,127Z][INFO][QFJ Message Processor:0x19] Already disconnected: Received logout response

It is a very subtle race condition that happens in the disconnect method. Just after the "SocketAcceptorIoProcessor" thread releases the responderSync lock, a context switch happens to the "QFJ Message Processor" thread. The "QFJ Message Processor" thread will see there is no responder, and returns through the finally block clearing the session state flags. After the return, there is a context switch back to the "SocketAcceptorIoProcessor" thread where it attempts to read the logon state flags, but they have been cleared by the other thread. The trick is getting a context switch to happen just after the responderSync synchronized block.

Comment by Christoph John [ 08/Mar/17 ]

Hi Paul Rowe,

there was a pull request just recently being created that should work around this issue: https://github.com/quickfix-j/quickfixj/pull/102
So it should be fixed with the next release.

Cheers,
Chris.

Comment by Paul Rowe [ 08/Mar/17 ]

Hi Chris, thanks for the response. I reviewed the pull request. The thing that I am currently struggling with now is a unit test. Any unit tests for this laying around ?

Comment by Christoph John [ 08/Mar/17 ]

Hi Paul Rowe,

I don't have a unit test ready. However, after browsing the existing ones it might be reproducible by the same technique as used in SessionResetTest. There is a PausableThreadPoolExecutor to which you can submit two threads. Each of the threads should then call Session.disconnect() concurrently. IMHO this should exhibit the problematic behaviour (maybe not always, though).

Cheers,
Chris.

Comment by Paul Rowe [ 08/Mar/17 ]

OK I will see if I can whip one up and if so I will also submit a pull request. Thanks.

Comment by Christoph John [ 08/Mar/17 ]

Great, thanks.

Comment by Paul Rowe [ 09/Mar/17 ]

I think the pull request generated by another QF user had issues (I commented on them there), so I generated my own pull request and submitted it to the QF team for review.

Generated at Sat Nov 23 07:51:10 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.