[QFJ-968] SessionState messageQueue causing out of memory exception Created: 06/Jan/19  Updated: 11/Jan/19

Status: Open
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Default
Reporter: Ryan Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Java 8



 Description   

The issues is based on my observation and investigation. Please correct me if I misunderstood it, thanks!

Scenario:

  • We use SocketInitiator to connect to vendor server and consume data.
  • ResendRequestChunkSize is set to 5000.
    Now assume our process is down for several hours, and 10 million messages are queued up on the vendor side. Once we get connected again, we will send about 2000 resend requests with batch size of 5000 to catch up to the latest message. In between of every two batches, vendor sends latest messages to us, which have much higher sequence number, and are added to the messageQueue in SessionState. They will not be processed until all the 10 million messages have been received by us, which means the memory will keep expanding during our "catching up" time. And this will eventually cause out of memory issue.

Proposed Solution:

Not limiting messageQueue size could be risky. My understanding is that messageQueue should only store temporary out of order messages, and if there are say 100,000 messages in it, there must be something wrong. Therefore options should be provided to discard newer messages instead of keep eating up the memory.

My proposal is to:

  1. Change LinkedHashMap to TreeMap. Even it may slow things down a little bit, O(lg n) should work fine given the size of messageQueue is normally very small. Even when it grows up, sequence number ordered entries should be beneficial.
  2. Update enqueue method to limit the size of messageQueue. If the queue is full, then add the new message and remove the entry with highest sequence number.
  3. Update dequeueMessagesUpTo method.

If you think this could work, I would be happy to submit a PR, thanks!



 Comments   
Comment by Christoph John [ 08/Jan/19 ]

I am not sure I understand fully: when you want to limit the queue size and discard messages if the queue is full, then why do you do a resend request for that many messages anyway? You could then just not do any resends (e.g. reset to seqnum 1)?

Apart from that: if you are transmitting millions of messages in a few hours then I am not sure if FIX is the right choice of protocol. I assume that some kind of volatile data is transmitted (e.g. market data) which in turn leads to the question why you need resends at all if the data is outdated in the meantime anyway.

Or did I misunderstand something?

Thanks,
Chris.

Comment by Ryan [ 09/Jan/19 ]

Hi Chris - Thank you for the reply! Let me try to give more details.

1. The reason we want to persist these messages is for future analysis. Normally the system works perfectly fine, but we want to handle the rarely happened situation where there is an issue and the system is down in the middle of the day. When that happens, we do need to send resend requests to catch up the outdated data.

2. My understanding is that there are two queues involved:

  • eventQueue in Session which is a bounded blocking queue
  • messageQueue in SessionState which is a map and not bounded

Every message has to go through the eventQueue, but only messages with higher seqNum will be put into messageQueue. Let's assume the batch size is 20 and consider the following event sequence:

  • initiator -> acceptor: expect 501, but seeing 10000, request resending of 501-520
  • acceptor -> initiator: resend 501-520, done
  • acceptor -> initiator: new messages 10001, 10002, 10003
  • initiator -> acceptor: request resending of 521-540
  • acceptor -> initiator: resend 521-540, done
  • acceptor -> initiator: new messages 10004, 10005
    ...

In the given sequence, messages 501-540 will be handled correctly, but 10001-10005 will be added to messageQueue and will not be touched until all messages 501-10000 have been requested and processed.

My proposal is to limit the size of the messageQueue (the map) to prevent the unbounded growing. For example if we set the limit to 4 messages, then the message 10005 and later messages will not be added to the messageQueue. When the initiator finishes handling messages 501-10004, it will expect 10005, but since it's discarded, initiator will try to request messages from 10005 to 10024 and move on.

This scenario will happen only when both of the following conditions are satisfied:

  • resend batch size is set.
  • new messages are coming in when the acceptor resends the messages.

Hope this better explained the scenario.

Thanks,
Ryan

Comment by Ryan [ 09/Jan/19 ]

BTW this is not a blocking issue for us, just want to report what we've observed during the development.

Comment by Christoph John [ 11/Jan/19 ]

Hi Ryan,

after thinking a little more about it I think it should work the way you said, i.e. if we limited the size of the queued messages QFJ will request them anyway as soon as it has dequeued all messages.

I have the following questions:
1. Why should this only be active when resend batch size is set? IMHO it could also happen that new messages are interleaved with resent messages. But of course that depends on how the counterparty has implemented their resend processing.
2. Why do you want to change the LinkedHashMap to TreeMap? Just because of the size limitation? Wouldn't a simple check if the allowed size has been reached (and then not queueing the message) be sufficient?
3. In the description you said "If the queue is full, then add the new message and remove the entry with highest sequence number." Why? I thought the idea was to not add the message?

Thanks,
Chris.

Comment by Christoph John [ 11/Jan/19 ]

Of course it will lead to massive number of messages since messages will be re-requested multiple times if the counterparty already sent them and we re-request them because we did not put them to the queue.

Comment by Ryan [ 11/Jan/19 ]

Hi Chris,

Let me pull the example sequence from above and try to explain my understanding:

  • initiator -> acceptor: expect 501, but seeing 10000, request resending of 501-520
  • acceptor -> initiator: resend 501-520, done
  • acceptor -> initiator: new messages 10001, 10002, 10003
  • initiator -> acceptor: request resending of 521-540
  • acceptor -> initiator: resend 521-540, done
  • acceptor -> initiator: new messages 10004, 10005
  • ...
    1. Totally agree that the limit can and should be applied for both batch and non-batch. Normally non-batch request won't cause this problem because of the acceptor logic. Let's take the same example, instead of using a batch of 20, if we send the resend request as 501-infinity, then messages 501-10000 will be added to the acceptor's send buffer. Now if new messages 10001-10005 appears on the acceptor side, they will only be appended to the end of the buffer, which means initiator will not see them until 501-10000 are received and processed. But if the acceptor takes another strategy of handling resend request, then non-batch may hit the same memory expanding problem. So I think you are right that we should limit the size of messageQueue in all cases.
    2 & 3. Generally speaking, we care more about the message sequence (MsgSeqNum) but not the insertion sequence. More specifically, let's assume the messageQueue is full with messages 10001-10004, and there is a temporary out of order message 542 coming in. Since we care more about 542 but not 10004 at this time, evicting 10004 would be the only option if we want to keep 542 and not expanding the queue size. By this way the TreeMap messageQueue will still work even after it's full.
    4. If the size limit is to be added, it should definitely be configurable so that user can choose between memory expanding and massive resend requests.

Thanks,
Ryan

Generated at Tue Nov 05 04:42:28 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.