You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice to add a romanization version of location names for locations that contain non-roman characters (CJK, Arabic, Hebrew, Cyrillic etc.). The goal is not to translate names, since many of these are proper nouns and are not suited for translation.
e.g. 아산시청 -> asan-si cheong
The functionality would be additive only and would not replace the original name, so some consideration about the the romanized name is stored in the DB is also needed.
In the ingestion process I proposed this will be a two step process:
Identify if a name has non-roman characters
romanize the characters
For the first step, as long as the names are coming in as unicode it seems like we can scan for matches across the different language character ranges. e.g. https://stackoverflow.com/a/50434862 and then identify the general character set.
For the romanization it seems like using individual libraries for each language/character group may be the best approach. I can't find a one-size-fits all library. One issue we will need to consider is in the case of some character sets where the characters have different sounds per language we may need to identify the language, beyond just character set used. e.g. Persian uses arabic characters but may have different romanized outputs, or any of the many language that use Cyrillic (Russian, Ukranian, Bulgarian, Mongolian.)
It would be nice to add a romanization version of location names for locations that contain non-roman characters (CJK, Arabic, Hebrew, Cyrillic etc.). The goal is not to translate names, since many of these are proper nouns and are not suited for translation.
e.g. 아산시청 -> asan-si cheong
The functionality would be additive only and would not replace the original name, so some consideration about the the romanized name is stored in the DB is also needed.
In the ingestion process I proposed this will be a two step process:
For the first step, as long as the names are coming in as unicode it seems like we can scan for matches across the different language character ranges. e.g. https://stackoverflow.com/a/50434862 and then identify the general character set.
For the romanization it seems like using individual libraries for each language/character group may be the best approach. I can't find a one-size-fits all library. One issue we will need to consider is in the case of some character sets where the characters have different sounds per language we may need to identify the language, beyond just character set used. e.g. Persian uses arabic characters but may have different romanized outputs, or any of the many language that use Cyrillic (Russian, Ukranian, Bulgarian, Mongolian.)
Any thoughts? @caparker @majesticio
The text was updated successfully, but these errors were encountered: