Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex for language codes does not accept es-419 #1352

Open
DanielVelaJ opened this issue Sep 17, 2024 · 4 comments
Open

regex for language codes does not accept es-419 #1352

DanielVelaJ opened this issue Sep 17, 2024 · 4 comments

Comments

@DanielVelaJ
Copy link


Describe the bug

When using Chainlit, setting Google Chrome's language to "español (Latinoamérica)" causes the application to fail with a 422 Unprocessable Entity error. The issue arises because Chainlit's language validation pattern does not accept the language code es-419, which corresponds to "español (Latinoamérica)". As a result, the application is unable to load translations and settings, preventing it from functioning properly.

To Reproduce

Steps to reproduce the behavior:

  1. Set Google Chrome Language to "español (Latinoamérica)":

    • Open Google Chrome.
    • Click on the three dots in the upper-right corner and select Configuración (Settings).
    • Scroll down and click on Configuración avanzada (Advanced) to expand advanced settings.
    • Under Idiomas (Languages), click on Idioma (Language).
    • Click Añadir idiomas (Add languages) and select "Español (Latinoamérica)".
    • Click on the three dots next to "Español (Latinoamérica)" and select Mover al principio (Move to the top) to make it the default language.
    • Restart Chrome to apply the changes.
  2. Run a Chainlit Application:

    • Open a terminal and run chainlit hello to start a basic Chainlit application.
  3. Open the Application:

    • In Google Chrome, navigate to http://localhost:8000.
  4. Observe the Error:

    • The application fails to load properly.
    • Open Chrome's developer console (press F12 or right-click and select Inspeccionar (Inspect), then go to the Console tab).
    • Notice multiple 422 Unprocessable Entity errors related to requests to /project/translations and /project/settings with the query parameter language=es-419.

Expected behavior

Chainlit should accept the es-419 language code corresponding to "español (Latinoamérica)" and load the appropriate translations if available. If translations for es-419 are not available, the application should gracefully fall back to a default language (e.g., es for general Spanish or en for English) without causing errors. The application should load normally and be fully functional regardless of the browser's language settings.

Screenshots
image

Browser Console Error

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser: Google Chrome
  • Version: Versión 128.0.6613.138 (Build oficial) (64 bits)

Smartphone (please complete the following information):

Not applicable.

Additional context

  • Error Details:

    The server returns the following error message:

    {
        "detail": [
            {
                "type": "string_pattern_mismatch",
                "loc": [
                    "query",
                    "language"
                ],
                "msg": "String should match pattern '^[a-zA-Z]{2,3}(-[a-zA-Z]{2,3})?(-[a-zA-Z]{2,8})?(-x-[a-zA-Z0-9]{1,8})?$'",
                "input": "es-419",
                "ctx": {
                    "pattern": "^[a-zA-Z]{2,3}(-[a-zA-Z]{2,3})?(-[a-zA-Z]{2,8})?(-x-[a-zA-Z0-9]{1,8})?$"
                }
            }
        ]
    }
  • Cause of the Issue:

    The error occurs because Chainlit's validation regex for the language query parameter does not accept numeric region codes like 419. The regex pattern only allows alphabetic characters in the region and variant parts, so es-419 (which corresponds to "español (Latinoamérica)") is rejected.

  • Impact:

    Users with Google Chrome set to "español (Latinoamérica)" cannot load Chainlit applications properly, affecting accessibility for Spanish-speaking users in Latin America and the Caribbean.

  • Workaround:

    Changing Chrome's language setting to general Spanish (es) or Spanish (Spain) (es-ES) allows the application to load correctly. However, this is not an ideal solution for end-users who prefer "español (Latinoamérica)".

  • Suggested Fix:

    • Modify the Validation Regex:

      Update the regex pattern in Chainlit's code to accept numeric region codes. For example:

      ^[a-zA-Z]{2,3}
      (-[a-zA-Z0-9]{2,3})?
      (-[a-zA-Z0-9]{2,8})?
      (-x-[a-zA-Z0-9]{1,8})?$

      This change allows numeric values in the region and variant parts, accommodating language codes like es-419.

    • Graceful Fallback:

      Implement logic to default to a base language (e.g., es) if a specific regional variant is not supported. If es-419 translations are not available, Chainlit should use es.json or en.json without causing errors.

  • References:

    • Chainlit Documentation:

      The documentation mentions that translation files are named after the language code and that the language is dynamically set based on the browser's language. However, it does not specify limitations regarding numeric region codes.

    • IETF Language Tags:

      According to the IETF BCP 47 standard, language tags like es-419 are valid and commonly used to represent regional variations.

  • Additional Notes:

    • Reproducing the Issue: The issue was observed exclusively on Google Chrome with the language set to "español (Latinoamérica)". Other browsers were not tested.

    • Translation Files: Attempting to add an es-419.json translation file did not resolve the issue due to the validation pattern rejecting the es-419 code.


@dokterbob
Copy link
Collaborator

Thanks for reporting this, it's really something we've missed.

The regexp which validates languages is in https://github.com/Chainlit/chainlit/blob/main/backend/chainlit/server.py#L647, I'd love a PR correcting this!

@dokterbob
Copy link
Collaborator

@dosu What would you suggest as a regex to extend support for BCP 47?

Copy link

dosubot bot commented Sep 23, 2024

Hey @dokterbob! I'm here to help you with any bugs, questions, or contributions you have while waiting for a human maintainer. Let's squash those bugs together! 🤖

To extend support for BCP 47 language tags, including numeric region codes like es-419, you can modify the regex pattern to allow numeric values in the region and variant parts. Here is an updated regex pattern that should work:

^[a-zA-Z]{2,3}(-[a-zA-Z0-9]{2,3})?(-[a-zA-Z0-9]{2,8})?(-x-[a-zA-Z0-9]{1,8})?$

This pattern allows for:

  • Primary language subtag: 2 to 3 alphabetic characters.
  • Optional region subtag: 2 to 3 alphanumeric characters.
  • Optional script or variant subtag: 2 to 8 alphanumeric characters.
  • Optional private use subtag: prefixed with -x- and followed by 1 to 8 alphanumeric characters.

This should cover language codes like es-419 and other valid BCP 47 tags.

To continue talking to Dosu, mention @dosu.

@erauld
Copy link

erauld commented Oct 3, 2024

Hi! I am experiencing the same issue. What @DanielVelaJ proposes solved it, but it should be an important fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants