[QFJ-959] Fix session could stuck forever in acceptor mode. Created: 07/Nov/18  Updated: 16/Nov/18

Status: Open
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 2.1.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Dmitry Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

Repetitive error: Multiple logons/connections for this session are not allowed
when client try to reconnect to server. (see server logs below)

Looks like session was not cleared correctly after exception.
Exception

Nov  6 06:47:32 aemulator: java.io.IOException: Broken pipe
Nov  6 06:47:32 aemulator: at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
Nov  6 06:47:32 aemulator: at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
Nov  6 06:47:32 aemulator: at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
Nov  6 06:47:32 aemulator: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
Nov  6 06:47:32 aemulator: at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
Nov  6 06:47:32 aemulator: at org.apache.mina.transport.socket.nio.NioProcessor.write(NioProcessor.java:384)
Nov  6 06:47:32 aemulator: at org.apache.mina.transport.socket.nio.NioProcessor.write(NioProcessor.java:47)
Nov  6 06:47:32 aemulator: at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.writeBuffer(AbstractPollingIoProcessor.java:1107)
Nov  6 06:47:32 aemulator: at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.flushNow(AbstractPollingIoProcessor.java:994)
Nov  6 06:47:32 aemulator: at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.flush(AbstractPollingIoProcessor.java:921)
Nov  6 06:47:32 aemulator: at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:688)
Nov  6 06:47:32 aemulator: at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
Nov  6 06:47:32 aemulator: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Nov  6 06:47:32 aemulator: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Nov  6 06:47:32 aemulator: at java.lang.Thread.run(Thread.java:748)

Server logs.

2018-11-07 13:06:56.731 +0000 ERROR [NioProcessor-471] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed
2018-11-07 13:06:57.736 +0000 ERROR [NioProcessor-472] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed
2018-11-07 13:06:58.740 +0000 ERROR [NioProcessor-473] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed
2018-11-07 13:06:59.780 +0000 ERROR [NioProcessor-474] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed
2018-11-07 13:07:01.446 +0000 ERROR [NioProcessor-475] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed
2018-11-07 13:07:03.446 +0000 ERROR [NioProcessor-476] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed
2018-11-07 13:07:05.446 +0000 ERROR [NioProcessor-477] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed
2018-11-07 13:07:07.446 +0000 ERROR [NioProcessor-478] quickfixj.errorEventFIX.4.4:emulator->TRADE:  - Multiple logons/connections for this session are not allowed


 Comments   
Comment by Christoph John [ 07/Nov/18 ]

Hi,

is the pasted stack trace complete? I can not gather much from it. There should also be log entries from QFJ along with the IOException. Can you post them?

I have never seen this behaviour before. Is it reproducible? Did it stop at some time or did it keep running like that?
When an exception occurs, the exceptionCaught() callback from the AbstractIoHandler should get notified and close the session.
Where are you running the QFJ session? Is it a plain Java process or are you running it inside an application server or Spring application?

Thanks,
Chris.

Comment by Christoph John [ 07/Nov/18 ]

And of course as always: would be better to first ask on the mailing list. We can clarify there if it really is a bug.

Comment by Dmitry [ 08/Nov/18 ]

Hi,

is the pasted stack trace complete? I can not gather much from it. There should also be log entries from QFJ along with the IOException. Can you post them?

We do not see another exceptions in log.

I have never seen this behaviour before. Is it reproducible? Did it stop at some time or did it keep running like that?

When an exception occurs, the exceptionCaught() callback from the AbstractIoHandler should get notified and close the session.

Where are you running the QFJ session? Is it a plain Java process or are you running it inside an application server or Spring application?

  • Reproducible from time to time.
  • This is Java plain application which use Spring Core inside.

I believe we found the way how to achieve this behavior. It is definitely not an expected working behavior, but as a result, quickfix could be stuck in some wrong state forever.

From time to time we have an issue with our HyperV cloud. Virtual machine is frozen for ~30-40 seconds. (it is testing environment)
During this time quickfix client tries to reconnect and send logon message.
When VM becomes alive, application throws an exception for TCP session and then try to handle new connection from client and process several logon messages for the same quickfix session.

Probably qfSession.hasResponder() keeps wrong state forever due to exception in Session.disconnect

  public void disconnect(String reason, boolean logError) throws IOException {
        try {
            final boolean logonReceived = state.isLogonReceived();
            final boolean logonSent = state.isLogonSent();

            synchronized (responderLock) {
                if (!hasResponder()) {
                    if (!ENCOUNTERED_END_OF_STREAM.equals(reason)) {
                        getLog().onEvent("Already disconnected: " + reason);
                    }
                    return;
                }
                final String msg = "Disconnecting: " + reason;
                if (logError) {
                    getLog().onErrorEvent(msg);
                } else {
                    getLog().onEvent(msg);
                }
                responder.disconnect();   // Maybe exception is here 
                setResponder(null);
            }


Comment by Christoph John [ 09/Nov/18 ]

Hi,

OK, if the whole VM froze for 30 seconds all sort of things can happen.

You are right: given the log message "Multiple logons/connections for this session are not allowed" it looks like the responder was not set to null.
But good that you can reproduce it. So you could checkout the code from github, tentatively wrap parts of the disconnect() method in a try block with cleanup of the responder in a finally-block and try to reproduce again.
But I would actually expect to see QFJ classes appear in the exception stack trace if something in that method failed. That is also why I was asking if this is a plain Java process. I don't know much about Spring but assume that some parts might run differently than in a "normal" Java process. E.g. sometimes Exceptions are swallowed if thrown in threads that are running as part of a thread pool.

When you say that the Exception might occur in or after responder.disconnect() is called then you should have something in your log file like "Disconnecting: " because that happens some lines above. Do you see something like that in your log file?
Could you maybe post the contents of your event log (including exceptions) around the freeze?

Thanks,
Chris.

Comment by Dmitry [ 16/Nov/18 ]

Hi,

I am sorry for late response.
We transferred our test environment from VM to hardware server.
So, logs were lost and currently we do not have similar issues.

Dmitry.

Comment by Christoph John [ 16/Nov/18 ]

OK, feel free to update the issue when you encounter problems again.

Generated at Sat Nov 23 11:00:11 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.