[QFJ-382] Foreign Language Support - Multibyte Characters - Chinese Created: 09/Dec/08  Updated: 02/Nov/15  Resolved: 09/Jun/14

Status: Closed
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.3.3
Fix Version/s: 1.6.0

Type: Improvement Priority: Default
Reporter: Jason Aubrey Assignee: amichair
Resolution: Fixed Votes: 3
Labels: encoding
Environment:

All


Attachments: Zip Archive Changes.zip    
Issue Links:
Duplicate
duplicates QFJ-38 FIX Message support double-byte charset. Closed
is duplicated by QFJ-666 FIXMessageEncoder got BufferOverflowE... Closed
Relates
relates to QFJ-789 Fully support alternate encodings (ch... Open
is related to QFJ-631 Wrong checksum calculation in "quickf... Closed
is related to QFJ-282 FIXMessageEncoder#encode() may throws... Closed

 Description   

I need QFJ to support Chinese characters. So I modified my working copy to add this functionality/tests. I could simply commit the changes but I don't have write access to the repository. I'll just post the relevant changes here for now. It'd be nice if I could simply add all the diffs as attachments to this message.

Message.java
<pre>
public String toString() {

  • header.setField(new BodyLength(bodyLength()));
    + try { + header.setField(new BodyLength(bodyLength())); + }

    catch(UnsupportedEncodingException e)

    { + LoggerFactory.getLogger(getClass()).error("toString failed, unsupported encoding", e); + return ""; + }

    trailer.setField(new CheckSum(checkSum()));

StringBuffer sb = new StringBuffer();
@@ -138,7 +145,7 @@
return sb.toString();
}

  • public int bodyLength() {
    + public int bodyLength() throws UnsupportedEncodingException { return header.calculateLength() + calculateLength() + trailer.calculateLength(); }

    </pre>

Field.java
<pre>

  • /package/ int getLength() {
    + /package/ int getLength() throws UnsupportedEncodingException { calculate(); - return data.length()+1; + return data.getBytes(CharsetSupport.getCharset()).length+1; }

    </pre>

FieldTest.java
<pre>

  • public void testFieldCalculations() {
    + public void testFieldCalculationsEnglish() throws Exception { Field<String> object = new Field<String>(12, "VALUE"); object.setObject("VALUE"); assertEquals("12=VALUE", object.toString()); @@ -63,6 +65,22 @@ assertEquals(544, object.getTotal()); assertEquals(9, object.getLength()); }
    +
    + public void testFieldCalculationsChinese() throws Exception {
    + try { + CharsetSupport.setCharset("UTF-8"); + int tag = 13; + String value = "\u6D4B\u9A8C\u6570\u636E"; + Field<String> object = new Field<String>(tag, value); + assertEquals(tag + "=" + value, object.toString()); + assertEquals(119127, object.getTotal()); + assertEquals(16, object.getLength()); + } catch(Exception e) { + throw e; + } finally { + CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); + }
    + }
    </pre>

    FIXMessageEncoder.java
    <pre>
    - public void testFieldCalculations() {
    + public void testFieldCalculationsEnglish() throws Exception { Field<String> object = new Field<String>(12, "VALUE"); object.setObject("VALUE"); assertEquals("12=VALUE", object.toString());@@ -63,6 +65,22 @@ assertEquals(544, object.getTotal()); assertEquals(9, object.getLength()); }

    +
    + public void testFieldCalculationsChinese() throws Exception

    Unknown macro: {+ try { + CharsetSupport.setCharset("UTF-8"); + int tag = 13; + String value = "\u6D4B\u9A8C\u6570\u636E"; + Field<String> object = new Field<String>(tag, value); + assertEquals(tag + "=" + value, object.toString()); + assertEquals(119127, object.getTotal()); + assertEquals(16, object.getLength()); + } catch(Exception e) { + throw e; + } finally { + CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); + }
    + }
    </pre>

    FIXMessageEncoderTest.java
    <pre>

    public void testWesternEuropeanEncoding() throws Exception {
    - // Default encoding, should work
    - doEncodingTest();
    -
    - try { - // This will break because of European characters - CharsetSupport.setCharset("US-ASCII"); - doEncodingTest(); - } catch (ComparisonFailure e) { - // expected - } finally { - CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); - }
    + // äbcfödçé
    + String input = "\u00E4bcf\u00F6d\u00E7\u00E9";
    +
    + // Default encoding, should work
    + doEncodingTest(input);
    +
    + try { + // This will break because of European characters + CharsetSupport.setCharset("US-ASCII"); + doEncodingTest(input); + } catch (ComparisonFailure e) { + // expected + } finally {+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());+ } }
  • private void doEncodingTest() throws ProtocolCodecException, UnsupportedEncodingException {
  • // äbcfödçé
  • String headline = "\u00E4bcf\u00F6d\u00E7\u00E9";
    + public void testChineseEncoding() throws Exception
    Unknown macro: {+ // "test data" in Chinese+ String input = "u6D4Bu9A8Cu6570u636E";+ + try { + // This will break because the characters cannot be represented properly + doEncodingTest(input); + } catch (ComparisonFailure e) { + // expected + }++ try { + // This should work + CharsetSupport.setCharset("UTF-8"); + doEncodingTest(input); + } finally { + CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); + }
    + }
    +
    + private void doEncodingTest(String input) throws ProtocolCodecException, UnsupportedEncodingException { News news = new News(); - news.set(new Headline(headline)); + news.set(new Headline(input)); FIXMessageEncoder encoder = new FIXMessageEncoder(); ProtocolEncoderOutputForTest encoderOut = new ProtocolEncoderOutputForTest(); encoder.encode(null, news, encoderOut); @@ -84,11 +105,24 @@ }
    }

    - public void testEncodingString() throws Exception {
    + public void testEncodingStringEnglish() throws Exception { FIXMessageEncoder encoder = new FIXMessageEncoder(); ProtocolEncoderOutputForTest protocolEncoderOutputForTest = new ProtocolEncoderOutputForTest(); encoder.encode(null, "abcd", protocolEncoderOutputForTest); assertEquals(4, protocolEncoderOutputForTest.buffer.limit()); }
    +
    + public void testEncodingStringChinese() throws Exception {
    + FIXMessageEncoder encoder = new FIXMessageEncoder();
    + ProtocolEncoderOutputForTest protocolEncoderOutputForTest = new ProtocolEncoderOutputForTest();
    +
    + try { + CharsetSupport.setCharset("UTF-8"); + encoder.encode(null, "\u6D4B\u9A8C\u6570\u636E", protocolEncoderOutputForTest); + } finally {+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());+ }+ assertEquals(12, protocolEncoderOutputForTest.buffer.limit());+ }

}
</pre>



 Comments   
Comment by Jason Aubrey [ 09/Dec/08 ]

The revision number of my working copy is 892 (was head revision last week at least).

Comment by Steve Bate [ 09/Dec/08 ]

Hi Jason,

Thanks for the patches. Have you verified that the checksum calculations work with these changes? The current calculation sums characters which are assumed to be 1-byte. This assumption is made to avoid the need to transcode the message string to bytes for the purpose of calculating the checksum.

Comment by Jason Aubrey [ 09/Dec/08 ]

Hi Steve,

I think there may have been some checksum related exceptions initially when sending multibyte characters due to how the buffer was allocated (based on character counts instead of byte count). However, I didn't modify the checksum code (shown below) since it still works in the same basic way.

private int checkSum(String s) {
int offset = s.lastIndexOf("\00110=");
int sum = 0;
for (int i = 0; i < offset; i++)

{ sum += s.charAt(i); }

return (sum + 1) % 256;
}

The only difference in behavior is that each character's value can be much larger than simple ASCII values. For example in utf-8, "\u65E0\u6548\u7684\u7528" which is equivalent to "无效的用" has four characters that are each four hex digits long. So if each of these were FFFF then the sum would be 4 * FFFF = 3FFFC (262,140 in base 10). Given that the sum is stored as an integer the only risk seems to be overflow, which would occur after 2,147,483,647. With four byte character encoding, the overflow would only occur after 8,192 characters (i.e. 2,147,483,647 / 262,140 ) and this assumes each character is FFFF which it would likely not be. I don't think this is a concern though. If it were a concern, 'sum' could be stored as a larger type. I didn't give any thought to the '% 256' logic since I figured it's unique enough.

Comment by amichair [ 09/Jun/14 ]

The above analysis is incorrect, since the checksum should be performed on the encoded bytes, not the source (UTF-16) characters. btw, to avoid an overflow you can use '& 0xFF' instead of '% 256'.

In any case, this is now fixed - thanks for the patches, which helped along the way.

Currently setting a charset via CharsetSupport should work with any charset that is a superset of ASCII, which luckily is most of them.

Comment by Kou Jun [ 02/Nov/15 ]

is there any sample code to send and receive Chinese characters ?
it seems still can't process Chinese characters properly!

Generated at Sat Nov 23 09:03:26 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.