fromUnicode() documentation bug

seasoned_geek · Post by **seasoned_geek** » Thu Apr 01 2021 1:50 pm

All,

As part of trying to track down a real answer to this issue: https://forum.copperspice.com/viewtopic.php?f=11&t=1755

Unraveling this

Code: Select all

		for (unsigned int i = 0; i < commitStrLen;) {
			const unsigned int ucWidth = commitStr.at(i).isHighSurrogate() ? 2 : 1;
			const QString oneCharUTF16 = commitStr.mid(i, ucWidth);
			const QByteArray oneChar = sqt->BytesForDocument(oneCharUTF16);

			sqt->InsertCharacter(std::string_view(oneChar.data(), oneChar.length()), EditModel::CharacterSource::directInput);
			i += ucWidth;
		}

ultimately tracks back to a method that returns either toUtf8() of a string _or_ the QByteArray from QTextCodec::fromUnicode()

Under Qt
=====
Detailed Description. QString stores a string of 16-bit QChars, where each QChar corresponds to one UTF-16 code unit. (Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.).
=====

So, whoever did this

Code: Select all

QByteArray ScintillaQt::BytesForDocument(const QString &text) const
{
	if (IsUnicodeMode()) {
		return text.toUtf8();
	} else {
		QTextCodec *codec = QTextCodec::codecForName(
				CharacterSetID(CharacterSetOfDocument()));
		return codec->fromUnicode(text);
	}
}

should have really done text.toUtf16() so both sides of the if match.

No problem. Hapless coder goes looking to see what fromUnicode() returns in its byte array under CopperSpice

Converts str from Unicode to the encoding of this codec and returns the result in a QByteArray. This method updates the state.

Is the content of this QChar32 so everything is 4-bytes wide and no need for high surrogate logic?
Is this still returning a QByteArray of UTF-16?

There is a similar issue for QTextCodec::convertFromUnicode

Basically, every place in the documentation where QByteArray is returned from something that could have been some form of a string the documentation needs to spell out what form the content is really in.

There is no safe assumption here.

ansel · Post by **ansel** » Fri Apr 02 2021 1:24 am

The intent of the QTextCodec class is to convert between a QString and some specified text encoding. For Qt, the QString is in UCS-2, whereas for CopperSpice the QString is in UTF-8. In either case, the QByteArray that is returned by QTextCodec::fromUnicode will be in whatever encoding was requested when calling QTextCodec::codecForName(). The purpose of the QTextCodec class has not changed although we did redesign the implementation.

seasoned_geek · Post by **seasoned_geek** » Fri Apr 09 2021 5:54 pm

ansel wrote: ↑Fri Apr 02 2021 1:24 am The intent of the QTextCodec class is to convert between a QString and some specified text encoding. For Qt, the QString is in UCS-2, whereas for CopperSpice the QString is in UTF-8. In either case, the QByteArray that is returned by QTextCodec::fromUnicode will be in whatever encoding was requested when calling QTextCodec::codecForName(). The purpose of the QTextCodec class has not changed although we did redesign the implementation.

Basically, every place in the documentation where QByteArray is returned from something that could have been some form of a string the documentation needs to spell out what form the content is really in.

Answers containing the word "whatever" breed instability.

What really messes things up for this code and that answer is the loss of isHighSurrogate()

At least I cannot find it anywhere in the CopperSpice documentation.

This code, for whatever reason, relied on being able to crawl byte by byte through the raw values. This meant some of the QTextCodec instances would/could return UTF-16 and they used isHighSurrogate() to determine that and automagically grab next byte. Much of this logic has to do with the drawing of Ime Indicators.

barbara · Post by **barbara** » Sun Apr 11 2021 7:54 am

. . . every place in the documentation where QByteArray is returned from something that could have been some form of a string the documentation needs to spell out what form the content is really in.

What really messes things up for this code and that answer is the loss of isHighSurrogate()

This code, for whatever reason, relied on being able to crawl byte by byte through the raw values. This meant some of the QTextCodec instances would/could return UTF-16 and they used isHighSurrogate() . . .

The first item to mention is that in CopperSpice QString is a typedef for QString8, so I will use QString for simplicity The contents of a QString object is encoded in UTF-8. There is no concept of a surrogate in UTF-8, which means there are no surrogate pairs to get mixed up. As an example, it takes four 8-bit storage units to represent a "musical note" character.

In UTF-16 a surrogate pair is required to encode this same character using two 16-bit storage units. This is part of the definition from the Unicode standard.

Keep in mind, the QString class in Qt does not actually support UTF-16. Internally it behaves more like UCS-2 encoding which was standardized before surrogate pairs were necessary. There are characters which simply can not be represented in UCS-2. These two encodings are very similar and for majority of code points the distinction does not matter, until you get past U+FFFF.

If QString in Qt were really UTF-16 then the "musical note" character would be handled internally and users would not be responsible for "gluing" the surrogates together. This is why they have the methods for isHighSurrogate() and isLowSurrogate() in QChar.

CopperSpice does not need to test for surrogate pairs in QSting8 since they do not exist. For QString16 we handle the surrogate pairs internally. We want to be clear, we removed isHighSurrogate() since it is never needed in CopperSpice.

The entire purpose for QTextCodec is to convert between (1) data stored in a QString and (2) data in a QByteArray, using the encoding specified in the parameters for QTextCodec.

For anyone reading this information, strings can be very complicated when you look under the hood. This is why we spent several months developing QString8 and QString16 and then refactoring the entire code base. Users of CS can trust that surrogates are handled correctly. One of the CS team members spent countless hours writing Catch tests for these classes.

Any code you are converting to CS which dealt with surrogates needs to be simplified. Usually this involves just removing all the code with handles surrogate pair manipulation.

Barbara

CopperSpice Forum

fromUnicode() documentation bug

fromUnicode() documentation bug

Re: fromUnicode() documentation bug

Re: fromUnicode() documentation bug

Re: fromUnicode() documentation bug