[QFJ-252] Quickfix/J stops sending tag 141 on logon Created: 27/Sep/07  Updated: 09/Jan/08  Resolved: 22/Dec/07

Status: Closed
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.2.1
Fix Version/s: None

Type: Bug Priority: Default
Reporter: André Malenfant Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None


 Description   

After several logon attemps, quickfix/J stops sending tag 141 and starts increasing the sequence number which has for effect to received a msg sequence error when the logon is eventually responded. See the log excerpt below and note that there are numerous logon attemps before and after it:

8=FIX.4.39=8035=A34=149=cgiwl352=20070905-10:12:44.51056=bofaCgiwlTest98=0108=30141=Y10=154
8=FIX.4.39=8035=A34=149=cgiwl352=20070905-10:12:55.50956=bofaCgiwlTest98=0108=30141=Y10=164
8=FIX.4.39=8035=A34=149=cgiwl352=20070905-10:13:06.54056=bofaCgiwlTest98=0108=30141=Y10=156
8=FIX.4.39=7435=A34=249=cgiwl352=20070905-10:13:18.52356=bofaCgiwlTest98=0108=3010=119
8=FIX.4.39=7435=A34=349=cgiwl352=20070905-10:13:29.52356=bofaCgiwlTest98=0108=3010=122
8=FIX.4.39=7435=A34=449=cgiwl352=20070905-10:13:40.52256=bofaCgiwlTest98=0108=3010=115
8=FIX.4.39=7435=A34=549=cgiwl352=20070905-10:13:51.52156=bofaCgiwlTest98=0108=3010=117



 Comments   
Comment by Toli Kuznets [ 27/Sep/07 ]

Andre,
can you attach your quickfixj.cfg file to this?
The values of ResetOnLogout/ResetOnDisconnect/ResetOnLogin all contribute to whether or not tag 141 gets sent.

Comment by André Malenfant [ 27/Sep/07 ]

ResetOnLogout not specified (N)
ResetOnDisconnect not specified (N)
ResetOnLogin not specified (N)
SendResetSeqNumFlag=Y

Comment by André Malenfant [ 27/Sep/07 ]

Sorry,

ResetOnDisconnect=Y

Comment by Steve Bate [ 06/Oct/07 ]

This won't be enough information for me to effectively diagnose the issue. If you start the Banzai example program with a similar configuration and with no corresponding acceptor program, do you see the same behavior?

Comment by André Malenfant [ 08/Jan/08 ]

We can reproduce this problem almost systematically on a production environment and this is a big problem for us as we cannot rely on automatic reconnection if connections are dropped. Most of the time, the flag is not sent for a period of time and then, all of the suden, it is sent. Sometimes it can take serveral minutes and close to an hour for the flag to be sent.

I understand that you don't have enough info to troubleshoot but can you give us ideas on what data to collect (I was about to say debugging level to help investigate?

Comment by Steve Bate [ 08/Jan/08 ]

André,

Well, I did make a suggestion back in October.

This doesn't appear to be a reconnect problem. You can't send a logon message if you aren't connected. Apparently, the reconnect is working fine.

For the sequence reset, QFJ will only sent send 141=Y when the expected sequence numbers are 1 for both sender and target. Notice that in your original log example the 141=Y stopped being sent when the sequence number was greater than 1.

The ResetOnDisconnect flag resets the sequence numbers every time there is a disconnect and this should cause the 141=Y to be sent. Are the logon attempts timing out? That should cause a disconnect, but I'm curious. In any case, it's seems /very/ strange to be doing a sequence reset on disconnect in a production environment. Are you doing anything in your code related to logon messages (for example, explicitly initiated them or modifying them in some way)?

Comment by André Malenfant [ 08/Jan/08 ]

Steve, I did not try your suggestion in October because I cannot reproduce the problem in my development environment with my app. I can only reproduce in prod. The only difference beeing that the remote application is different and I have no control over it.

Sorry for the confusion I meant when a logout occurs (whether scheduled or because of a connection drop) quickfixj will eventually start to issue logon messages (after the connection is reestablished, of course) and will not send the flag in all cases. I cannot find any pattern that could help understand the conditions for it.

In my original log, quickfixj sends the flag 3 times and then, all of the suden, starts incrementing the sequence number and stops sending the flag. Why is this so? Why would it start incrementing the sequence number without having received a logon response from the other party? Nothing happens in between except quickfixj retrying a logon every mintue.

The other case we see is the opposite. Quickfixj does not send the sequence reset flag even tough the sequence number has been reset on disconnect. Which obviously creates an issue and the other party denies the login because of the numbers beeing desynchronized. Eventually, after some time, quickfixj resets the number and starts to send the flag. Sometimes, to speed up the process I restart my app until this occurs without having done anything else than restarting.

The reason that we want to reset is because the data that is transmitted is transent and thus cannot be resent. There is no point in recovering lost messages and we simply want to reset the sequence numbers when we disconnect.

We are not doing anything special with regard to the login. We let quickfixj handle that. The only message we create and send explicitely are application level messages.

This is a real life production issue, and it is problematic for us.

Comment by Steve Bate [ 08/Jan/08 ]

> In my original log, quickfixj sends the flag 3 times and then, all of the suden, starts incrementing the
> sequence number and stops sending the flag. Why is this so?

Like I said, QFJ will only send the sequence resets when the sender/target sequence numbers are 1. The fourth and subsequent messages in your attached log has a sequence number greater than 1.

Which message store are you using? Are you using the PersistMessages=N session option? This might eliminate the need to hack the sequence numbers when no messages should ever be resent (like with market data).

Can you create a timeline that includes both the logon message events and connect/disconnect events? This might give some clues.

Comment by André Malenfant [ 08/Jan/08 ]

Agreed but why did quickfixj started incrementing the sequence between too logon attempt? The sender was never involved in the log above.

I will document more clearly an occurance of the problem and get back to you.

Here is my config (I scrambled some sensitive data...):

[DEFAULT]
ConnectionType=initiator
ReconnectInterval=60

[SESSION]
BeginString=FIX.4.3
SenderCompID=SomeCompID
TargetCompID=TargetCompID
StartTime=00:00:00
EndTime=00:00:00
HeartBtInt=30
SocketConnectPort=9999
SocketConnectHost=MyHost
UseDataDictionary=N
SessionQualifier=MySession
SendResetSeqNumFlag=N
ResetOnDisconnect=Y
DataDictionary=...\config\schemas\quickfix\FIX43.xml
JdbcDriver=net.sourceforge.jtds.jdbc.Driver
JdbcURL=jdbc:jtds:sqlserver:server:port
JdbcUser=andre
JdbcPassword=password
JdbcStoreMessagesTableName=quickfixMessages
JdbcStoreSessionsTableName=quickfixSessions

Thanks

Comment by André Malenfant [ 09/Jan/08 ]

Steve, the config I sent previously is not valid, this was a test config I used to try different settings.

You should read:

SendResetSeqNumFlag=Y
ResetOnDisconnect=N

We had the issue on 17 apps yesterday night. Here is an example: (timestamps in Europe\London)

1- It starts with a scheduled logout coming from the other party which I changed here to DEF
2- We respond to the logout successfully
3- Quickfix then starts to try to reconnect and keeps the sequence number even thoug we have have SendResetSeqNumFlag=Y . Not a big deal since the other party in not responding yet.
4- At the 7th retry, quickfix reset the sequence to 1 but does NOT send the flag
5- The other party returns an error, a valid one since we did not ask him to reset the sequence.
6- This goes on over an hour until the flag is sent after numerous restarts.
7- Notice the last lines where, finally, quickfix sends the flag.

We know have set ResetOnDisconnect=Y since it reduces the number of restarts we need to do to correct the situation to one.
We recognise that there is something strange in the fact that the other party returns an error in the first place. It is expecting a sequence number that is higher than everything we ever sent. I will try to get this explain. But the sequence reset problem remains valid to me.

8=FIX.4.39=8335=549=DEF56=ABC34=288152=20080108-22:04:5958=Scheduled logout10=107
8=FIX.4.39=6735=534=266549=ABC52=20080108-22:04:59.99656=DEF10=077
8=FIX.4.39=7935=A34=266649=ABC52=20080108-22:05:00.71556=DEF98=0108=3010=094
8=FIX.4.39=7935=A34=266749=ABC52=20080108-22:06:01.71556=DEF98=0108=3010=097
8=FIX.4.39=7935=A34=266849=ABC52=20080108-22:07:02.73156=DEF98=0108=3010=098
8=FIX.4.39=7935=A34=266949=ABC52=20080108-22:08:02.73156=DEF98=0108=3010=100
8=FIX.4.39=7935=A34=267049=ABC52=20080108-22:09:03.74756=DEF98=0108=3010=101
8=FIX.4.39=7935=A34=267149=ABC52=20080108-22:10:03.76356=DEF98=0108=3010=092
8=FIX.4.39=7235=A49=DEF56=ABC34=152=20080108-22:10:04108=3098=010=233
8=FIX.4.39=11935=534=267249=ABC52=20080108-22:10:04.92056=DEF58=MsgSeqNum too low, expecting 2882 but received 110=152
8=FIX.4.39=7935=A34=267349=ABC52=20080108-22:11:04.76456=DEF98=0108=3010=097
8=FIX.4.39=7235=A49=DEF56=ABC34=352=20080108-22:11:05108=3098=010=237
8=FIX.4.39=11935=534=267449=ABC52=20080108-22:11:05.48356=DEF58=MsgSeqNum too low, expecting 2882 but received 310=162
8=FIX.4.39=7935=A34=267549=ABC52=20080108-22:12:05.78056=DEF98=0108=3010=099
8=FIX.4.39=7235=A49=DEF56=ABC34=552=20080108-22:12:05108=3098=010=240
8=FIX.4.39=11935=534=267649=ABC52=20080108-22:12:06.46756=DEF58=MsgSeqNum too low, expecting 2882 but received 510=170

... around an hour later ...

8=FIX.4.39=12135=534=281249=ABC52=20080108-23:16:45.88356=DEF58=MsgSeqNum too low, expecting 2882 but received 13510=009
8=FIX.4.39=8235=A34=149=ABC52=20080108-23:17:45.77456=DEF98=0108=30141=Y10=244
8=FIX.4.39=7835=A49=DEF56=ABC34=152=20080108-23:17:46108=30141=Y98=010=042

Comment by André Malenfant [ 09/Jan/08 ]

Here is a working logout/login sequence coming from the same app with the same config:

8=FIX.4.39=8235=549=DEF56=ABC34=37052=20071120-22:04:5958=Scheduled logout10=043
8=FIX.4.39=6635=534=37049=ABC52=20071120-22:05:00.04256=DEF10=238
8=FIX.4.39=8235=A34=149=ABC52=20071120-22:05:00.66756=DEF98=0108=30141=Y10=226
8=FIX.4.39=8235=A34=149=ABC52=20071120-22:06:01.68356=DEF98=0108=30141=Y10=226
8=FIX.4.39=8235=A34=149=ABC52=20071120-22:07:02.70056=DEF98=0108=30141=Y10=218
8=FIX.4.39=8235=A34=149=ABC52=20071120-22:08:02.71656=DEF98=0108=30141=Y10=226
8=FIX.4.39=8235=A34=149=ABC52=20071120-22:09:03.71756=DEF98=0108=30141=Y10=229
8=FIX.4.39=8235=A34=149=ABC52=20071120-22:10:04.73356=DEF98=0108=30141=Y10=220
8=FIX.4.39=7835=A49=DEF56=ABC34=152=20071120-22:10:05108=30141=Y98=010=023

Comment by Steve Bate [ 09/Jan/08 ]

First, I assume that you are using QFJ 1.2.1, given the "affects version". Please tell me if that's not correct.

For future comments, please replace the SOH character in the FIX message log records with a delimiter (like '|'). The delimeters are currently being lost and it makes it much more difficult and time consuming to interpret the log entries.

The SendResetSeqNumFlag has no effect in QFJ 1.2.1. It remained in the documentation from an earlier version but it should not have been in the 1.2.1 documentation. I didn't mention this earlier because you were also using ResetOnDisconnect and I didn't want to confuse the issue. I recommend you remove SendResetSeqNumFlag from your configuration file to avoid any further confusion.

You also did not give me the timeline of logon messages and connection events. The logs of logon messages alone do not provide enough information to diagnose the problem since I don't know when the connection is established and broken. For example, I'm not sure if there is a disconnect between
each logon attempt or the connection is established and the counterparty is just not responding for some reason which would cause the logons to timeout. By not providing the relevant events from the event log (again, a merged timeline of messages and events would be most helpful) it makes it much more difficult for me to help you effectively.

Please keep in mind that free support on QFJ depends on the availability of my spare time. There are organizations that can provide commercial support for QFJ and they are listed on the web site. A few of these are in Europe.

Generated at Wed Nov 27 02:14:16 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.