[QFJ-220] QFJ hangs in IoSessionResponder.disconnect() while waiting for scheduled messages to be written Created: 06/Aug/07  Updated: 11/Feb/09  Resolved: 14/Aug/07

Status: Closed
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: None
Fix Version/s: 1.3.0

Type: Bug Priority: Default
Reporter: Toli Kuznets Assignee: Steve Bate
Resolution: Fixed Votes: 0
Labels: None
Environment:

HEAD revision of QFJ (actually, rev 710)


Attachments: Text File responder.patch     Text File socket_state.txt     Text File trace.txt    
Issue Links:
Relates
relates to QFJ-215 QFJ deadlocks in Session.disconnect()... Closed

 Description   

This is a continuation of QFJ-215.
I'm now seeing a new hang that happens in IoSessionResponder.disconnect() on the waitForScheduleMessagesToBeWritten() line.

Seems that one of the connecting sockets gets the underlying MINA framework in a "bad state" and the IoSessionResonder.waitForScheduleMessagesToBeWritten() function never gets out of the loop that's waiting for the messages to be written.

I have 2 acceptor sessions running with MINA 1.1.0, and both end up in this state after a few days of being up.
See the attached logs for the full stack dump, and the output of netstat command (on Ubuntu Linux).

The relevant portion of the stack trace is:
"SocketAcceptorIoProcessor-0.0" prio=1 tid=0x082427c0 nid=0x546e sleeping[0xb0c71000..0xb0c71f60]
at java.lang.Thread.sleep(Native Method)
at quickfix.mina.IoSessionResponder.waitForScheduleMessagesToBeWritten(IoSessionResponder.java:55)
at quickfix.mina.IoSessionResponder.disconnect(IoSessionResponder.java:43)
at quickfix.Session.disconnect(Session.java:1370)
at quickfix.mina.AbstractIoHandler.exceptionCaught(AbstractIoHandler.java:82)

Seems that the waitForScheduledMessagesToBeWritten() call that was added in rev 698 (http://quickfixj.svn.sourceforge.net/viewvc/quickfixj/trunk/core/src/main/java/quickfix/mina/IoSessionResponder.java?view=diff&r1=697&r2=698) is the crux.



 Comments   
Comment by Toli Kuznets [ 06/Aug/07 ]

Output of 'netstat an|grep 7001" (our acceptor socket) command on Linux.
The main 7001 socket is established, but all the incoming connections are hung.

Comment by Toli Kuznets [ 06/Aug/07 ]

Full Java stack trace at time of hang

Comment by Steve Bate [ 07/Aug/07 ]

Damn. The reason the wait functionality is there is because disconnect would frequently complete before the logout message was sent. We could just wait once for a short time (say 50ms) and then do the close. That should generally be enough time for the logout to be sent. If it's not send, it's not a critical problem, but it's polite behavior to send the message.

Comment by Toli Kuznets [ 07/Aug/07 ]

Steve, what can we be doing differently in QFJ than others aren't with MINA?
From the stack trace, it looks like the problem originates in SocketIoProcessor.read (line 227, i had been using MINA 1.1.1)
It's sitting in the catch(Throwable) inside the read loop - so can it be that there's an error on the socket, it fires an exception, and then we end up waiting for the socket to send out some messages while the socket-related code has already essentially "panicked"?

as a result, we can't send any outgoing Logout messages out and end up hanging.

Is this possible? where's the code path that tries to send something out? Or does that mean we just have some "outstanding" messages waiting to go out when the socket code panics, and that it's not necessarily a logout we are waiting to send out?

Comment by Steve Bate [ 07/Aug/07 ]

I think your analysis probably correct, but without digging into MINA deeply I'm not sure how to verify it. My guess is that a message send has been initiated, there is a socket error, and the pending write count is never decremented (although strictly speaking, there are no more pending writes).

I'm not sure we are doing anything differently in checking the written message count. However, we can't call close().join() (the related problem) because we use a manual thread model. MINA is typically used with some type of thread pool executor. QF manages the threads in the Connector implementations and I didn't want another thread handoff between MINA and the Connectors. In the long term, I'd like to eliminate the thread management in the Connectors and use a more typical MINA threading architecture. The only sticky issue is that messages must be delivered to Sessions in the order they were received.

Comment by Toli Kuznets [ 07/Aug/07 ]

What if we overload the Session.disconnect() call to take a boolean specifying whether or not to try and wait for messages to be cleared?

in that case, when we call Session.disconnect() from AbstractIoHandler.exceptionCaught() we tell it to skip waiting for messages to be written, and let it do the waiting in all the other cases?

Then, if our theory is correct, that should probably fix the problem. What kind of message are we expecting to be sent out? A log out?

I can try this modification on the Marketcetera Exchange simulator and ask the users to pound the exchange and see what happens.
Is there any way to exercise this? I thin it may be triggered by having many incoming simultaneous connections on different SessionIDs.

Comment by Toli Kuznets [ 11/Aug/07 ]

Patch for the propagating a boolean on whether or not to wait for all messages to be written out.
This is not the most elegant solution, as it adds another function call to Responder API that duplicates the existing one.

If we choose to go with this solution, i'd vote to modify the Responder API to remove the overloaded call, and only have 1 with the boolean.

I'll update the bug in a few days after this code runs long enough behind the outward-facing Marketcetera Exchange Simulator (http://exchange.marketcetera.com/marketDisplay) to show there are no other lingering issues with dynamic acceptor sessions

Comment by Steve Bate [ 14/Aug/07 ]

We'll only wait for a limited about of time (up to 50ms) for messages to be written. This should eliminate the hang.

Generated at Sat Nov 23 15:03:04 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.