Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Adding function(s) to return the printable width of a String or Stringlike types? #3785

Open
1 task done
thatstoasty opened this issue Nov 19, 2024 · 4 comments
Labels
enhancement New feature or request mojo-repo Tag all issues with this label needs-discussion Need discussion in order to move forward

Comments

@thatstoasty
Copy link
Contributor

Review Mojo's priorities

What is your request?

For terminal based applications, it's usually required to know the printable width of characters. Could this possibly be added to the stdlib for String, StringSlice, etc.? I've usually seen these unicode-width packages implemented outside of the standard library of a few languages, so I wanted to hear thoughts from the team/contributors!

Some examples:
Rust: https://github.com/unicode-rs/unicode-width/tree/master
Go: https://github.com/mattn/go-runewidth/tree/master
My simple port of go-runewidth: https://github.com/thatstoasty/gojo/blob/main/src/gojo/unicode/utf8/width.mojo

What is your motivation for this change?

It would be nice to have, but I understand keeping the stdlib lean as well.

Any other details?

No response

@thatstoasty thatstoasty added enhancement New feature or request mojo-repo Tag all issues with this label labels Nov 19, 2024
@martinvuyk
Copy link
Contributor

Hi, if you mean the amount of utf8 bytes that the character needs, all 3 stringlike types have a byte_length() function since they are all utf8 encoded. String.__len__ is supposed to return unicode length (in the future, it's a breaking change that I'm trying to push ASAP), StringSlice.__len__ does return unicode length.

PS: if you're working a lot with strings, you should take a look at all the helpers in /utils/string_slice.mojo, many are currently private but I would like to make them public over time since I know there are a lot of people like yourself who also want to do things with strings without reimplementing everything we already use 😄 .

@thatstoasty
Copy link
Contributor Author

thatstoasty commented Nov 20, 2024

Hey @martinvuyk thanks for the response! I don't mean the byte length, but the width of the character once printed. Some characters have a printable width of 0, and others like emojis and East Asian characters can have a width of 2.

For example: 🔥🔥🔥🔥 has a printable width of 8, with a length of 4 and a byte length of 16.

Please correct me if I'm wrong on the byte length!

@martinvuyk
Copy link
Contributor

Oh ok now I understand why the code forks for East Asian characters 😄. I think things like this and grapheme clusters might or might not be worth it adding to the stdlib. Insofar as I've read it seems like this is only used for terminal printing, so IMO this has less chances than grapheme clusters.

Also another thing that I very often think about is memory cost, especially I'm thinking about global lookup tables and the cost of having them in scope. I'm not sure how well Mojo prunes away unused code for some functions after compilation, but my main issue is the more load we add to the memory requirements of using the Mojo stdlib, the harder it will be for Mojo to run on microcontrollers. My personal hope is to have to stop writing C (or C flavored C++ when using Arduino) because everything else is just too heavy to run on ~300 KiB DRAM (~4 KiB stack) and 1 or 2 RISC CPU cores running at ~160 MHz. (this worry would be meaningless if Mojo prunes everything perfectly at compile time 🤷‍♂️ ).

PS: I think this is a great case for a community library where it provides tools for developing CLI libraries (exactly what your prism repo is ;) ). argparse is also a Python module in the stdlib which might not get incorporated in the Mojo stdlib (or a very bare-bones version), and if a community package provides printing and parsing functionallity it would reduce the workload for the stdlib team which is currently struggling as it is already 😞.

@thatstoasty
Copy link
Contributor Author

Understandable! I figured other languages kept it external for similar reasons. It's a specific domain, so perhaps it's better off living in a blessed library in the future.

Maybe it could piggy back off of existing utf8 validation logic in the utils module, but I'm no Unicode expert so that's a task for future me or someone else more knowledgeable haha.

@JoeLoser JoeLoser added the needs-discussion Need discussion in order to move forward label Nov 21, 2024 — with Linear
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mojo-repo Tag all issues with this label needs-discussion Need discussion in order to move forward
Projects
None yet
Development

No branches or pull requests

3 participants