[99% SOLVED] [FEEDBACK] Locale-sensitive string compare (Unicode Collation Algorithm, UCA)?

에 의해 Gary V

2013년 11월 07일 04:58 영어 (English) 11 댓글

Hi,

Tizen string comparison ::wcscmp(), Tizen::Base::String::CompareTo(), Tizen::Base::StringComparer::Compare() do not seem to be locale sensitive, but rather "an ordinal comparison of each Unicode character", according to Tizen::Base::String documentation. Effectively, non-English (accented) characters come after all English (Latin) characters, so for example a non-accented "z" comes BEFORE an accented "a", which is counter-intuitive. Tried changing device settings for "Date and Time" and/or "Language and keyboard", without any effect.

Anyone knows how to compare strings in a locale-specific manner, usually represented by the Unicode Collation Algorithm (UCA)? Did I miss something in the docs?

Thanks,

Gary

Edited by: Gary V on 13 11월, 2013

Responses

11 댓글

Gary V

2013년 11월 07일 08:17

FYI, what seems to somewhat work is the following:

// WARNING: "C" - minimal "C" locale (the same as locale::classic) does NOT work
std::locale loc(/* the environment's default locale */ "");

const std::collate<wchar_t> & col = std::use_facet<std::collate<wchar_t> >(loc);

int result = col.compare(...);

However, there is no way to convert a particular Tizen::Locales::Locale to the corresponding std::locale, so the code has to rely on an unknown "environment's default locale", instead of creating the chosen std::locale explicitly. Not very reliable.

Perhaps Tizen should support locale-sensitive string compare explicitly, for example as:

int Tizen::Base::String::CompareTo(const String & str, const Tizen::Locales::Locale & locale) const;

Gary

이 게시물에 답글 달기
답변으로 선정

Yoonsoo Kim

2013년 11월 09일 09:38

Thanks for raising this issue.

As you noticed, unfortunately String class does not support locale-dependent comparision.

You can use C++ locale or wcscoll() or wcscoll_l() function for now and those APIs remain portable because standard C/C++ library is a part of Tizen compliance.

It would be good to add locale-dependent comparison APIs at Tizen 3.0 but the API may be different from what you expect because there may be a circular dependency problem between Tizen::Locales::Locale and Tizen::Base::String.

이 게시물에 답글 달기
답변으로 선정

Gary V Yoonsoo Kim

2013년 11월 09일 10:18

Thanks for the reply.

The problem is that there is no way to specify the desired locale explicitly, in case a mobile device is set to, say, the U.S. locale, but a user wants to sort on the French locale. Also, the documentation for std::collate and ::wcscoll() does not guarantee that the locale used for performing the comparison is actually the one a mobile device has been configured with. ::wcscoll() is only defined as "interpreted as appropriate to the LC_COLLATE category of the current locale".

I meant the Tizen::Base::String API in the following manner:

LocaleManager* pLocaleManager = new (std::nothrow) LocaleManager();
pLocaleManager->Construct();
    
Locale systemLocale = pLocaleManager->GetSystemLocale();

String str(L"contentA");
str.Compare(L"contentB", /* use device configuration explicitly */ systemLocale);

Gary

이 게시물에 답글 달기
답변으로 선정

Yoonsoo Kim

2013년 11월 09일 19:23

I understand what you ask for and wcscoll_l() may serve your use case. It takes an explicit locale_t argument.

이 게시물에 답글 달기
답변으로 선정

Gary V Yoonsoo Kim

2013년 11월 11일 03:06

Hi,

Sorry, I wasn't clear. I meant to say that wcscoll_l(..., ..., locale_t locale) has the same problem as std::collate - there is no Tizen API or a documented (in other words, guaranteed) way to convert a Tizen::Locales::Locale (for example, from LocaleManager::GetSystemLocale() representing the user-chosen device locale) into locale_t, through newlocale(LC_ALL_MASK, <locale-name>, NULL).

newlocale() documentation mentions that "the locale string is typically the name of one of the directories in /usr/share/locale". However, enumerating LocaleManager::GetAvailableLocalesN() on a Tizen emulator device, newlocale() works only for "hi_IN" and "hy_AM", otherwise it returns NULL. For example, newlocale() fails for "en_US" and "fr_FR", even though /usr/share/locale/en_US and /usr/share/locale/fr_FR actually exist. Three-plus-two locale code strings ("eng_US", "fra_FR", ...) through Tizen::Locales::Locale::GetLocaleCodeString() do not work. Finally, two-letter lanuage codes alone ("en", "fr", ...) do not work, either.

Effectively, there seems to be no way to use wcscoll_l() just like std::collate. Suggestions?

Gary

이 게시물에 답글 달기
답변으로 선정

Yoonsoo Kim

2013년 11월 12일 22:28

You can get the list of locales, which are supported by the system, executing "locale -a" command with "sdb shell" or clicking the right button on the emulator screen and selecting shell menu.

이 게시물에 답글 달기
답변으로 선정

Gary V Yoonsoo Kim

2013년 11월 13일 04:14

Thanks for the tip.

FYI, two problems:

1.) Tizen::Locales::Locale::LanguageCodeToTwoLetterLanguageCodeString() differentiates between Cyrillic and Latin locales ("az-cyrl", "az-latn", ...), but newlocale() does NOT, with a single exception of "sr_RS.utf8@latin". Not sure how this should be handled.

2.) All newlocale() locale names are designated as UTF-8, with the exception of the following: "fa_IR", "hi_IN", "hy_AM", "sr_RS", "sr_RS@latin", "ur_PK" and "vi_VN". Not sure how this should be handled.

Sample:

//
Tizen::Locales::LocaleManager* pLocaleManager = new (std::nothrow) Tizen::Locales::LocaleManager();
pLocaleManager->Construct();

//
Tizen::Locales::Locale systemLocale = pLocaleManager->GetSystemLocale();
delete pLocaleManager, pLocaleManager = null;

//
Tizen::Base::String strLanguageName =
    Tizen::Locales::Locale::LanguageCodeToTwoLetterLanguageCodeString(systemLocale.GetLanguageCode());

// WARNING: LanguageCodeToTwoLetterLanguageCodeString() differentiates between Cyrillic and Latin locales
// ("az-cyrl", "az-latn", ...), but newlocale() does NOT (with a single exception of "sr_RS.utf8@latin")

if (strLanguageName.GetLength() > 2)
	strLanguageName.SubString(0, 2, /* onto self */ strLanguageName);

// locale name (see "locale -a" on a Tizen device): "en_US.utf8", "en_GB.utf8", "fr_FR.utf8", ...
//
// WARNING: [ignored] non-utf8 exceptions:
// "C", "fa_IR", "hi_IN", "hy_AM", "POSIX", "sr_RS", "sr_RS@latin", "ur_PK", "vi_VN"

Tizen::Base::String strLocaleName =
	strLanguageName +
	L"_" +
	Tizen::Locales::Locale::CountryCodeToString(systemLocale.GetCountryCode()) +
	L".utf8";

Tizen::Base::ByteBuffer* pBuf = Tizen::Base::Utility::StringUtil::StringToUtf8N(strLocaleName);

// it's actually healthy to be paranoid about locale mapping

locale_t aLocale = (locale_t) NULL;

try
{
	aLocale = ::newlocale(LC_ALL_MASK, reinterpret_cast<const char*>(pBuf->GetPointer()), (locale_t) NULL);
}
catch(...)
{
	// ...
}

delete pBuf, pBuf = null;

if (!aLocale)
{
	// ...
}

// FYI: wchar.h
int iCompareResult = ::wcscoll_l(<string1>, <string2>, aLocale);

I think we all agree that the above is a bit too convoluted and Tizen should support a simple:

int Tizen::Base::String::CompareTo(const String & str, const /* Tizen::Locales:: */ Locale & loc) const;

Gary

이 게시물에 답글 달기
답변으로 선정

Chintan Gandhi

2013년 11월 13일 03:55

Hi Gary,

Thanks for bringing this issue to our notice. Kindly bear with us. We will get back to you soon.

Thanks again.

이 게시물에 답글 달기
답변으로 선정

Yoonsoo Kim

2013년 11월 13일 06:17

I agree that locale-aware comparison feature is necessary but don't agree with your API signature for the feature. As I mentioned before, there could be a circular dependency with your API signature between Tizen::Locales and Tizen::Base. Circular dependency could be a maintenance headache and should be avoided at all cost. Currently there is one-way dependency from Tizen::Locales to Tizen::Base. Basically Tizen::Base module is the most fundamental APIs for the whole Tizen Native APIs. If Tizen::Locales has a dependency on the lower layer, Tizen::Base, it's ok but the dependency on the opposite way or bidirectional way is pretty bad. It may be better to put the related APIs in Tizen::Locales namespace. Thanks!

이 게시물에 답글 달기
답변으로 선정

Gary V Yoonsoo Kim

2013년 11월 13일 06:42

I see your point.

Of course, anything in the form of, say:

// mirror Tizen::Base::Utility::StringUtil

static int Tizen::Locales::Utility::LocaleUtil::Compare(const String & strLHS, const String & strRHS, const Tizen::Locales::Locale & locale);

... would be just as good.

It would also be useful to have an API to reliably convert Tizen::Locales::Locale into locale_t, for those who use std::string or wchar_t* and actually prefer to stay with ::wcscoll_l() for efficiency (no need to create Tizen::Base::String).

Thanks for the help!

Gary

이 게시물에 답글 달기
답변으로 선정

Yoonsoo Kim Gary V

2013년 11월 13일 06:46

Thanks for another TODO item. :-)

이 게시물에 답글 달기
답변으로 선정

사용자 메뉴

Tizen Developers

커뮤니티

커뮤니티

Responses

검색 폼

언어 설정

Responses