[QFJ-220] QFJ hangs in IoSessionResponder.disconnect() while waiting for scheduled messages to be written Created: 06/Aug/07 Updated: 11/Feb/09 Resolved: 14/Aug/07 |
|
Status: | Closed |
Project: | QuickFIX/J |
Component/s: | Engine |
Affects Version/s: | None |
Fix Version/s: | 1.3.0 |
Type: | Bug | Priority: | Default |
Reporter: | Toli Kuznets | Assignee: | Steve Bate |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Environment: |
HEAD revision of QFJ (actually, rev 710) |
Attachments: | responder.patch socket_state.txt trace.txt | ||||||||
Issue Links: |
|
Description |
This is a continuation of Seems that one of the connecting sockets gets the underlying MINA framework in a "bad state" and the IoSessionResonder.waitForScheduleMessagesToBeWritten() function never gets out of the loop that's waiting for the messages to be written. I have 2 acceptor sessions running with MINA 1.1.0, and both end up in this state after a few days of being up. The relevant portion of the stack trace is: Seems that the waitForScheduledMessagesToBeWritten() call that was added in rev 698 (http://quickfixj.svn.sourceforge.net/viewvc/quickfixj/trunk/core/src/main/java/quickfix/mina/IoSessionResponder.java?view=diff&r1=697&r2=698) is the crux. |
Comments |
Comment by Toli Kuznets [ 06/Aug/07 ] |
Output of 'netstat an|grep 7001" (our acceptor socket) command on Linux. |
Comment by Toli Kuznets [ 06/Aug/07 ] |
Full Java stack trace at time of hang |
Comment by Steve Bate [ 07/Aug/07 ] |
Damn. The reason the wait functionality is there is because disconnect would frequently complete before the logout message was sent. We could just wait once for a short time (say 50ms) and then do the close. That should generally be enough time for the logout to be sent. If it's not send, it's not a critical problem, but it's polite behavior to send the message. |
Comment by Toli Kuznets [ 07/Aug/07 ] |
Steve, what can we be doing differently in QFJ than others aren't with MINA? as a result, we can't send any outgoing Logout messages out and end up hanging. Is this possible? where's the code path that tries to send something out? Or does that mean we just have some "outstanding" messages waiting to go out when the socket code panics, and that it's not necessarily a logout we are waiting to send out? |
Comment by Steve Bate [ 07/Aug/07 ] |
I think your analysis probably correct, but without digging into MINA deeply I'm not sure how to verify it. My guess is that a message send has been initiated, there is a socket error, and the pending write count is never decremented (although strictly speaking, there are no more pending writes). I'm not sure we are doing anything differently in checking the written message count. However, we can't call close().join() (the related problem) because we use a manual thread model. MINA is typically used with some type of thread pool executor. QF manages the threads in the Connector implementations and I didn't want another thread handoff between MINA and the Connectors. In the long term, I'd like to eliminate the thread management in the Connectors and use a more typical MINA threading architecture. The only sticky issue is that messages must be delivered to Sessions in the order they were received. |
Comment by Toli Kuznets [ 07/Aug/07 ] |
What if we overload the Session.disconnect() call to take a boolean specifying whether or not to try and wait for messages to be cleared? in that case, when we call Session.disconnect() from AbstractIoHandler.exceptionCaught() we tell it to skip waiting for messages to be written, and let it do the waiting in all the other cases? Then, if our theory is correct, that should probably fix the problem. What kind of message are we expecting to be sent out? A log out? I can try this modification on the Marketcetera Exchange simulator and ask the users to pound the exchange and see what happens. |
Comment by Toli Kuznets [ 11/Aug/07 ] |
Patch for the propagating a boolean on whether or not to wait for all messages to be written out. If we choose to go with this solution, i'd vote to modify the Responder API to remove the overloaded call, and only have 1 with the boolean. I'll update the bug in a few days after this code runs long enough behind the outward-facing Marketcetera Exchange Simulator (http://exchange.marketcetera.com/marketDisplay) to show there are no other lingering issues with dynamic acceptor sessions |
Comment by Steve Bate [ 14/Aug/07 ] |
We'll only wait for a limited about of time (up to 50ms) for messages to be written. This should eliminate the hang. |