[QFJ-74] Explicitly control the String encoding of the FIX messages Created: 22/Sep/06 Updated: 09/Jun/14 Resolved: 25/May/07 |
|
Status: | Closed |
Project: | QuickFIX/J |
Component/s: | None |
Affects Version/s: | 1.0.0 Final, 1.0.1, 1.0.2, 1.0.3 |
Fix Version/s: | 1.2.0 |
Type: | Bug | Priority: | Major |
Reporter: | Steve Bate | Assignee: | Steve Bate |
Resolution: | Fixed | Votes: | 1 |
Labels: | None |
Attachments: | ChecksumTest.java | ||||||||
Issue Links: |
|
Description |
This is currently and issue either with multibyte characters in FIX messages (somewhat rare, but it's been requested by a user) and with checksum problem with unintentional nonprinting characters in messages. The default encoding differs by platform so it's not a good idea to use the implicit default encoding. |
Comments |
Comment by Steve Bate [ 22/Sep/06 ] |
We'll need to modify both the FIX decoder and encoder. |
Comment by Brad Harvey [ 23/Sep/06 ] |
I think the checksum validation/calculation should be in the decoder/encoder on the bytes just received/about to be sent. This makes it possible to correctly calculate the checksum on messages you receive which contain byte sequences that can't be mapped to your charset. |
Comment by Brad Harvey [ 07/Oct/06 ] |
I've attached some tests I was playing with to help me understand my checksum problem. I was surprised by the result of testDecodeDoubleByte - it passes in trunk - so I thought it was worth sharing. Most of the other tests aren't as interesting. It seems that the change to FIXMessageDecoder.getMessageString to use the new String(byte[], charsetName) constructor has fixed the issue I was seeing with double bytes causing checksum failures in 1.0.2. From the javadoc: The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required. It seems that the "unspecified" behaviour is to allow the invalid bytes through unchanged. What I found surprising was that there doesn't seem to be a way to emulate this behaviour using the Ignore/Replace/Report mechanism of CharsetDecoder. So hopefully the unspecified behaviour is consistent across JVMs (and alternate charsets?)! |
Comment by Steve Bate [ 10/Nov/06 ] |
I'm moving this to a post 1.1 release. I can see what needs to be done but the trick will be finding a way to do that doesn't impact performance too negatively. The comments and tests have addressed the checksum and related unsigned byte issues but there are also issues with message length calculations when using MB character sets. |
Comment by Brad Harvey [ 10/Nov/06 ] |
When I looked at it I actually wondered if two options might be needed - the "old" one (that so far seems to do the job for most people) and a new one that handles double byte chars but may be slower. Having said that, I'd imagine choice of MessageStore would have a bigger performance impact. |
Comment by Steve Bate [ 10/Nov/06 ] |
That's basically the approach I've been taking. One issue is that the way the body length and checksum is currently calculated, the specified charset would have to be pushed down to the field level and each field would need to do a character decoding to a byte array to determine it's length in bytes and it's contribution to the checksum. If I do that and I have the conditional to provide the current behavior if the default charset (US-ASCII) is being used, then it might be reasonable. |
Comment by Jörg Thönnes [ 21/Mar/07 ] |
We have issues with German and Italian character sets. These character sets use After the log line for the incoming log, the error displayed is: Mar 21, 2007 12:01:05 PM quickfix.mina.AbstractIoHandler messageReceived At the moment, we have no way to check automatically for this error. Is there anyway to |
Comment by Jörg Thönnes [ 31/Mar/07 ] |
Looking at some CharsetDecoder examples, I am wondering whether setting the encoding/decoding char set to Latin-1 |
Comment by Jörg Thönnes [ 31/Mar/07 ] |
In the code, I found two places where the char set name "US-ASCII" is used. To check the conversion of umlauts, I used the following snippet: //String charSet = "US-ASCII"; The output is ISO_8859-1 US-ASCII So in our case, setting the char set name to "ISO_8859-1" would help. That is, we need a char set configurable globally or per session. |
Comment by Steve Bate [ 02/Apr/07 ] |
Is the US-ASCII you saw used in the Multibyte branch? I couldn't find it in the trunk and I don't currently have a branch workspace checked out. |
Comment by Jörg Thönnes [ 02/Apr/07 ] |
FIXMessageDecoder.java, SVN 555: FIXMessageEncoder.java, SVN 555: public FIXMessageDecoder() { this("US-ASCII"); } |
Comment by Jörg Thönnes [ 11/Apr/07 ] |
This little Java program shows the current default charset: public class ShowDefaultCharset { } On Linux, the output depends on the setting of the LANG environment variable: LANG=POSIX java -cp . ShowDefaultCharset LANG=en_US.iso88591 java -cp . ShowDefaultCharset LANG=de_DE.utf8 java -cp . ShowDefaultCharset Since the ISO-8859-1 character sets covers the whole 8 bits of a byte, it should work well for most 1 byte non- ASCII charsets. |
Comment by Jörg Thönnes [ 11/Apr/07 ] |
Here is an extended java program which also checks the byte to character mappings: public class ShowDefaultCharset { byte[] b = new byte[256]; final String x = new String ( b ); } Applying this program to ISO_8859-1 shows no change, ie this charset is really 1:1. This means that the default character set for QuickFIX/J should be ISO_8859-1. For multi-byte character sets, extra effort has to be done. |
Comment by Jörg Thönnes [ 11/Apr/07 ] |
Another extension checks for the reverse direction. Due to the signed integer, I add 256 modulo 256 to get the positive value: public class ShowDefaultCharset { byte[] b = new byte[256]; final String x = new String ( b ); final int bb = (bx[i]+256) % 256; } For ISO_8859-1, this works fine. |
Comment by Jörg Thönnes [ 11/Apr/07 ] |
I suggest to make this patch to the 1.1.0 release to make the FIXMessageEncoder equivalent to the FIXMessageDecoder: Index: /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java package quickfix.mina.message; +import java.io.UnsupportedEncodingException; @@ -50,6 +51,8 @@ + private String charsetName = "ISO_8859-1"; ByteBuffer buffer = ByteBuffer.allocate(fixMessageString.length());
Steve, what do you think? |
Comment by Jörg Thönnes [ 11/Apr/07 ] |
The suggested patch would complement the changes made with revision 527 in the FIXMessageDecoder. |
Comment by Jörg Thönnes [ 12/Apr/07 ] |
I would also remove the setCharset methods in both FIXMessageEncoder/Decoder since To allow other charsets, more work in the validation method has to be done. |
Comment by Steve Bate [ 12/Apr/07 ] |
I originally was using US-ASCII (in the branch) because that's the FIX specification requires ASCII for nonencoded fields. However, I have no problem supporting other character sets (single byte, for now). Is there any reason why we wouldn't use UTF-8 instead of ISO_8859-1? |
Comment by Jörg Thönnes [ 12/Apr/07 ] |
Because ISO_8859-1 is the only charset with 1:1 mapping. I tried US-ASCII, UTF-8 and ISO_8859-15, and every of these charsets mapping some of the bytes differently. Therefore, exposing the setCharset() method makes sense as soon as the validate() method computes the checksums on the plain bytes. My idea is to have a factory (FramingStrategyFactory) which returns a FramingStrategy for a given charset. The FramingStrategy for ISO_8859-1 could simply take the String length and operate on the Java String directly, while single-byte strategies could take the String length, but compute the checksum on the bytes and multi-byte strategies also compute the length on the bytes. The FIX Message constructor would have an optional charset argument, and if the message is sent down the link, a check is made whether the encoder charset is compatible to the Message charset. If not, the checksum and possibly the length are recomputed. But I would promote this FramingStrategy stuff to a new JIRA issue. |
Comment by Jörg Thönnes [ 12/Apr/07 ] |
Today I noticed that outgoing message with national characters are encoded wrong on the first attempt, |
Comment by Jörg Thönnes [ 19/Apr/07 ] |
OK, for outgoing messages, it currently works as follows: 1. Without PossDup=Y: The checksum is computed on the String directly and then forwarded to the encoder. The checksum is wrong. 2. Inside ResendRequest: The String is retrieved from MessageStore, where it has saved as byte[] array. In this way, the "bad" characters seem In summary, non ASCII characters cause exactly one resend round-trip for outgoing messages. |
Comment by Steve Bate [ 25/May/07 ] |
I checked in changes to allow the message encoding to be set on a JVM-wide basis. See the CharsetSupport and it's uses to see the specific changes. I apologize that I forgot to add the issue tag to the commit so they SVN changes aren't linked to this issue. |
Comment by amichair [ 09/Jun/14 ] |
Following resolution of |