[QFJ-895] Reconnecting initiator does not work under some circumstances Created: 23/Jun/16  Updated: 13/Dec/16  Resolved: 11/Jul/16

Status: Closed
Project: QuickFIX/J
Component/s: None
Affects Version/s: 1.6.2
Fix Version/s: 1.6.3

Type: Bug Priority: Major
Reporter: Christoph John Assignee: Guido Medina
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by QFJ-868 IoSessionInitiator can't reconnect th... Closed
is duplicated by QFJ-898 auto reconnect fails? Closed
Relates
relates to QFJ-866 Initiator session timer with SSL enabled Closed

 Description   

Youyu Shao provided some fixes on the mailing list.

Original mail:

On 23/06/16 02:35, Youyu Shao wrote:
> QuickFIX/J Documentation: http://www.quickfixj.org/documentation/
> QuickFIX/J Support: http://www.quickfixj.org/support/
>
>
> Recently we have been seriously looking into using QuickFixJ (code base
> 1.6.2).
> In so doing, we have discovered several problems and subsequently resolved
> in house.
>
> 1. Mina would throw org.apache.mina.core.RuntimeIoException wrapping
> java.net.SocketException at various places.
> For example, this happens at:
> org.apache.mina.core.RuntimeIoException: java.net.SocketException:
> Invalid argument: no further information
> at
> org.apache.mina.transport.socket.nio.NioSocketSession$SessionConfigImpl.setTcpNoDelay(NioSocketSession.java:208)
> at quickfix.mina.NetworkingOptions.apply(NetworkingOptions.java:155)
> at
> quickfix.mina.AbstractIoHandler.sessionCreated(AbstractIoHandler.java:93)
> at
> quickfix.mina.initiator.InitiatorIoHandler.sessionCreated(InitiatorIoHandler.java:50)
> .....
> When this happens, exceptionCaught method of AbstractIoHandler is not
> able to properly detect there was an underlying socket IOException.
> AbstractIoHandler has been modified to un-wrap the exception and see if
> along the exception chain there is an IOException.
>
> 2. org.apache.mina.core.session.IoSession internal state is problematic. On
> and off,
> IoSession.isConnected() does not return the correct value after QuickFixJ
> called IoSession.close() method(up on experiencing IOException).
> quickfix.mina.initiator.IoSessionInitiator has been refactored to replace
> the old IoConnector with a new one upon experiencing IOException.
> quickfix.mina.AbstractIoHandler is adjusted accordingly to signal the
> IoSessionInitiator to do so.
>
> 3. Modified the run() method of ConnectTask in IoSessionInitiator to be
> protected by try block. This is to prevent the scheduled Task from
> stop running in presence of exception. (see javadoc on
> scheduleWithFixedDelay(...) method of
> java.util.concurrent.ScheduledExecutorService
> which states that "...If any execution of the task encounters an
> exception, subsequent executions are suppressed...")
>
> 4. Again, org.apache.mina.core.session.IoSession internal state is
> problematic. Even after the
> normal disconnect() from IoSessionResponder, on and off the
> IoSession.isConnected() still does not return the correct value.
> quickfix.mina.IoSessionResponder is adjusted to signal the
> IoSessionInitiator to replace the IoConnector. IoSessionResponderTest is
> adjusted accordingly.
>
>
> 5. quickfix.ThreadedSocketAcceptor leaks resources when stoped.
> Instead of calling stopSessionTimer() in the stop() method, it should
> call stopInitiators().
>
> Problems identified in 1, 2, 3, and 4 ultimately manifested them in ways
> that reconnecting would stop working (forever), even absent of IOExceptions.
>
> With the fix, we have the QuickFixJ engine running for a week where
> reconnecting works with no problem. During this period the counterparty
> logged out and severed the TCP connection more than 20000 times. Our test
> also indicates that the fix works equally SSL vs non-SSL.
>
> Modified files are attached.
>
> AbstractIoHandler.java
> <http://quickfix-j.364392.n2.nabble.com/file/n7579557/AbstractIoHandler.java>
> IoSessionInitiator.java
> <http://quickfix-j.364392.n2.nabble.com/file/n7579557/IoSessionInitiator.java>
> IoSessionResponder.java
> <http://quickfix-j.364392.n2.nabble.com/file/n7579557/IoSessionResponder.java>
> IoSessionResponderTest.java
> <http://quickfix-j.364392.n2.nabble.com/file/n7579557/IoSessionResponderTest.java>
> ThreadedSocketInitiator.java
> <http://quickfix-j.364392.n2.nabble.com/file/n7579557/ThreadedSocketInitiator.java>
>



 Comments   
Comment by Guido Medina [ 23/Jun/16 ]

I'm having this issue also with QFJ 1.7.x (current master), it seems that the initiator is never able to recover from this, I was still trying to figure out what this is happening:
On a Linux system once you close the connection, netstat shows that a socket is still owned by the Java socket, to the point that I had to restart the OS in order to make the initiator connect again.

Comment by Guido Medina [ 23/Jun/16 ]

The state issue with Apache Mina might has been fixed with latest, which isn't maybe what QFJ 1.6.3 is using?

Comment by Christoph John [ 23/Jun/16 ]

Since QF/J 1.6.2 the MINA version 2.0.13 is used. You submitted a pull request for that. Do you have a link to the MINA issue which should fix this? On QFJ-868 you refer to https://issues.apache.org/jira/browse/DIRMINA-995 but this is probably another problem.

Comment by Guido Medina [ 23/Jun/16 ]

You are right, it definitely looks like another problem, will you take care of that PR?, if you can't make it before 5:30PM London time I can download the patch files and send a PR.

Comment by Christoph John [ 23/Jun/16 ]

Hi Guido, that would be great, thank you. I could not look at this earlier than next week or at the weekend.

Comment by Guido Medina [ 23/Jun/16 ]

Glad to do it, I'll send you a PR late tonight.

Comment by Christoph John [ 23/Jun/16 ]

Thanks, really appreciated.

Comment by Guido Medina [ 23/Jun/16 ]

I was able to surgically modify the other 4 which I will commit on master, but IoSessionInitiator.java I couldn't after last SSL refactored.

Comment by Guido Medina [ 23/Jun/16 ]

Check commit: https://github.com/quickfix-j/quickfixj/pull/76/commits/d82566e3d5165fb3873dc253328a3a61d3054c66

Comment by Guido Medina [ 23/Jun/16 ]

Concerning the class I couldn't patch, scratch that, I figured it out.

Comment by Guido Medina [ 23/Jun/16 ]

Simplified commit for easier code review: https://github.com/quickfix-j/quickfixj/pull/76/commits/a72d3ef5f7a9a99fd129d4430585d6591d1e2596

Comment by Guido Medina [ 23/Jun/16 ]

PRs submitted for the following branches:

Comment by Guido Medina [ 24/Jun/16 ]

The patch broke a test by moving

getNextSocketAddress()

to inside a try...catch, for this case this exception needs to be thrown: https://github.com/quickfix-j/quickfixj/pull/78

Generated at Sat Nov 23 07:24:36 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.