[QFJ-824] QuickFIXJ stops sending and processing heartbeats Created: 02/Feb/15 Updated: 12/Nov/18 Resolved: 12/Nov/18 |
|
Status: | Closed |
Project: | QuickFIX/J |
Component/s: | Engine |
Affects Version/s: | 1.5.3 |
Fix Version/s: | None |
Type: | Bug | Priority: | Major |
Reporter: | Wongsakorn Chantrapornsyl | Assignee: | Unassigned |
Resolution: | Incomplete | Votes: | 1 |
Labels: | QuickfixJ, timeout | ||
Environment: |
Microsoft Windows Server 2008 |
Description |
I have a problem that QuickFIXJ stops sending the heartbeat. Please see the QuickFIXJ log below. The heartbeat interval is set to 4. 20150127-01:01:34: 8=FIXT.1.19=7535=034=318541128=949=TR MATCHING56=TMSQ03226252=20150127-01:01:34.78910=197 Do you have any idea why QuickFIXJ stops sending the heartbeat and also does not process the message from other side. |
Comments |
Comment by Christoph John [ 02/Feb/15 ] |
Looks like a deadlock to me (at least from what can be seen from the log output). A stack trace at that time would be very useful. |
Comment by Christoph John [ 02/Feb/15 ] |
Do you happen to have a stack trace? Otherwise I cannot tell where the problem is. |
Comment by Anupam Sinha [ 02/Feb/15 ] |
I am unable to edit my comments. Can you please grant me the required permissions. Otherwise you can delete my comment and I will post more info for the same. |
Comment by Christoph John [ 02/Feb/15 ] |
Deleted your previous comment. |
Comment by Anupam Sinha [ 02/Feb/15 ] |
Thanks Christoph! I have added the logs from our connector as well. There is no stack trace that was generated hence unable to attach the same. 8=FIX.4.4^A9=60^A35=0^A49=<SENDER>^A56=<INITIATOR>^A34=9323^A52=20150202-05:18:58.924^A10=134^A 8=FIX.4.4^A9=60^A35=0^A34=8999^A49=<INITIATOR>^A52=20150202-05:18:59.396^A56=<SENDER>^A10=156^A 8=FIX.4.4^A9=60^A35=0^A49=<SENDER>^A56=<INITIATOR>^A34=9324^A52=20150202-05:19:03.925^A10=127^A 8=FIX.4.4^A9=82^A35=1^A49=<SENDER>^A56=<INITIATOR>^A34=9325^A52=20150202-05:19:05.405^A112=20150202-05:19:05^A10=179^A 8=FIX.4.4^A9=60^A35=0^A49=<SENDER>^A56=<INITIATOR>^A34=9326^A52=20150202-05:19:10.405^A10=120^A 8=FIX.4.4^A9=106^A35=A^A34=9000^A49=<INITIATOR>^A52=20150202-05:19:16.203^A56=<SENDER>^A98=0^A108=5^A553=<INITIATOR>@====.com^A554=<PWD>^A10=129^A 8=FIX.4.4^A9=71^A35=A^A49=<SENDER>^A56=<INITIATOR>^A34=9327^A52=20150202-05:19:16.454^A98=0^A108=5^A10=129^A We are also on 1.5.3 and facing the same issue. [02 Feb 2015 05:18:58,926][QFJ Message Processor][INFO ][resources.FIXMessageConnectionFactoryAgent]-session FIX.4.4:<INITIATOR>-><SENDER>, fromAdmin : 8=FIX.4.4,9=60,35=0,34=9323,49=<SENDER>,52=20150202-05:18:58.924,56=<INITIATOR>,10=134, [02 Feb 2015 05:18:59,396][QFJ Timer][INFO ][resources.FIXMessageConnectionFactoryAgent]-toAdmin : 8=FIX.4.4,9=60,35=0,34=8999,49=<INITIATOR>,52=20150202-05:18:59.396,56=<SENDER>,10=156, [02 Feb 2015 05:19:11,512][SocketConnectorIoProcessor-0.0][INFO ][quickfix.Session]-[FIX.4.4:<INITIATOR>-><SENDER>] Disconnecting: IO Session closed [02 Feb 2015 05:19:11,513][SocketConnectorIoProcessor-0.0][INFO ][resources.FIXMessageConnectionFactoryAgent]-Logout of FIX session with exchangeFIX.4.4:<INITIATOR>-><SENDER> [02 Feb 2015 05:19:11,513][SocketConnectorIoProcessor-0.0][INFO ][util.ConnectionStatusHistory]-02/02/2015 05:19:11.513 GMT FIXMessageConnectionFactoryAgent Disconnected [02 Feb 2015 05:19:16,203][SocketConnectorIoProcessor-0.0][INFO ][initiator.InitiatorIoHandler]-MINA session created for FIX.4.4:<INITIATOR>-><SENDER>: local=/11.161.40.160:22770, class org.apache.mina.transport.socket.nio.SocketSessionImpl, remote=<SENDER>-hc-nat/IP.add.res.ss:port [02 Feb 2015 05:19:16,204][QFJ Timer][INFO ][resources.FIXMessageConnectionFactoryAgent]-toAdmin : 8=FIX.4.4,9=71,35=A,34=9000,49=<INITIATOR>,52=20150202-05:19:16.203,56=<SENDER>,98=0,108=5,10=109, [02 Feb 2015 05:19:16,529][QFJ Message Processor][INFO ][resources.FIXMessageConnectionFactoryAgent]-Logout of FIX session with exchangeFIX.4.4:<INITIATOR>-><SENDER> |
Comment by Christoph John [ 02/Feb/15 ] |
The stack trace is not automatically generated. If the issue appears, then you need to do a "jstack <PID>" on the java process. On Windows you could also press ctrl+break on the application's window which should send the stack trace into the log file. What happens in "resources.FIXMessageConnectionFactoryAgent-toAdmin"?? I think this is no general issue so without a stack trace we won't get far. Unless you are able to reproduce the issue in a unit test. |
Comment by Anupam Sinha [ 02/Feb/15 ] |
Any suggestions on how to issue the jstack command when the issue occurs? It will be difficult to predict when the issue occurs and issue a command then. |
Comment by Christoph John [ 02/Feb/15 ] |
You could trigger that e.g. when a TestRequest comes in (which most of the time means that the counterparty missed heartbeats from your side). What about my other points? |
Comment by Wongsakorn Chantrapornsyl [ 03/Feb/15 ] |
Unfortunately, I found this issue on the client's machine (production). I cannot run the command after the issue appears. In this case, after disconnection, it can reconnect to the FIX server again and then work fine. However, they do not want any disconnection. Here is the configuration. -------------------------------------------------- ReLogonInterval=30000 |
Comment by garima gangwar [ 03/Feb/15 ] |
On Behalf of Anupam Sinha, Configurations are: connectionType=initiator |
Comment by Anupam Sinha [ 03/Feb/15 ] |
What happens in "resources.FIXMessageConnectionFactoryAgent-toAdmin"? Simply logs the request when the request is NOT a logon request. |
Comment by Christoph John [ 03/Feb/15 ] |
Wongsakorn Chantrapornsyl : OK, so the client has a software which uses QuickFIX/J? garima gangwar, Anupam Sinha: what is bufferSize config? Is it SocketSend/ReceiveBufferSize?? @all: what was the last application message that was processed prior to this problem? Is it always the same app message or is the time span after the last application message the same before the session breaks? |
Comment by Wongsakorn Chantrapornsyl [ 03/Feb/15 ] |
Hi Christoph John ReLogonInterval and RetryUserSessionLogonInterval are the internal configuration. They are not the QuickFIXJ configuration. what was the last application message that was processed prior to this problem? Is it always the same app message or is the time span after the last application message the same before the session breaks? Does it always break at the same time? Or after same intervals? Is there only one FIX session configured? I assume there are not much messages transmitted? So this should be no performance problem, shouldn't it? I currently cannot say much without having a stack trace. But I assume it is no general error. If it must be run when the issue occurs, it is quite hard to notice the problem occurs. Thank you. |
Comment by Christoph John [ 03/Feb/15 ] |
Hi, a heartbeat is no application message. An ExecutionReport/NewOrderSingle/... are application messages. However, you said that only heartbeats were received. Regarding your initial log output in the issue description: you say that the engine does receive Heartbeats/TestRequests but does not answer them. So I would change the fromAdmin method of your application to check if the received message is a TestRequest. If yes, then create a stack trace and/or check for deadlocked threads. This can be achieved with some MBeans which are included in Java6 and up (ManagementFactory.getThreadMXBean()) Moreover, when I look at the log there seems to be a problem with the times. E.g. the first messages arrive in time (comparing the log time stamp and field 52/SendingTime). However, afterwards there is a gap of about three seconds. So this either means your network is flaky or the application/computer is heavily loaded and does not get the messages in time. |
Comment by Wongsakorn Chantrapornsyl [ 03/Feb/15 ] |
From the sniffer message, I can see that the ACK of the heartbeat is sent immediately after the heartbeat arrived. Then I think the QuickFIXJ process the heartbeat delayed. The last activity from QuickFIXJ was sending the Test Request. You can see that QuickFIXJ has already received heatbeat but QuickFIXJ sent test request. Then I think QuickFIXJ does not process the heartbeat. I do not understand why QuickFIXJ can send the test request but does not send the heartbeat. |
Comment by Christoph John [ 03/Feb/15 ] |
This is triggered by two different threads. There is a separate thread which sends out heartbeats. The processing of incoming messages is in a separate thread (message processor thread). Incoming messages are also handled on the message processor thread. This will also call your application callbacks fromAdmin/fromApp. So I would start searching there. 20150127-01:01:50: 8=FIXT.1.19=7535=034=318571128=949=TR MATCHING56=TMSQ03226252=20150127-01:01:47.03910=192 20150127-01:01:51: 8=FIXT.1.19=9035=134=318581128=949=TR MATCHING56=TMSQ03226252=20150127-01:01:48.063112=HBTO-3185810=242 The 52/SendingTime is off by three seconds. So actually you "simply" have to find out what happens there. Maybe it is not even within your application but the machine starts to get slow. 20150127-01:01:52: 8=FIXT.1.19=7535=034=318591128=949=TR MATCHING56=TMSQ03226252=20150127-01:01:52.12910=190 20150127-01:01:51: 8=FIXT.1.19=10435=134=3237949=TMSQ03226252=20150127-01:01:51.89656=TR MATCHING57=FXM142=TRFXMJP53776xxx112=TEST10=200 So the log file goes "backwards"?! In my opinion you are doing something nasty that you probably shouldn't do. I think the problem might be somewhere in your application. You are either doing something concurrently that shouldn't be done or your application is doing some time-consuming logic at some point. Are you concurrently modifying quickfix.Message objects? What are you doing anyway in your application code? |
Comment by Anupam Sinha [ 04/Feb/15 ] |
garima gangwar, Anupam Sinha: what is bufferSize config? Is it SocketSend/ReceiveBufferSize?? This value is now not being used anymore. What is SocketSend/ReceiveBufferSize? @all: what was the last application message that was processed prior to this problem? Is it always the same app message or is the time span after the last application message the same before the session breaks? Does it always break at the same time? Or after same intervals? Is there only one FIX session configured? I assume there are not much messages transmitted? So this should be no performance problem, shouldn't it? |
Comment by Christoph John [ 12/Nov/18 ] |
Closing due to inactivity and old version used. |