Hello, everyone. I built Unicorn: an embeddable MISRA C:2012 implementation of essential Unicode Algorithms.
Unicorn is designed to be fully customizable: you can select which Unicode algorithms and character properties are included or excluded from compilation. You can also exclude Unicode character blocks wholesale for scripts your application does not support. It's perfect for resource constrained devices like microcontrollers and IoT devices.
About me: I quit my Big Corp job a few years back to pursue my passion for software development and this is one of my first commercial releases.
On https://railgunlabs.com/unicorn/manual/misra-compliance/, I think you will want to fix a typo in
Or am I too pedantic?1.2 Required Compliant (verified by compiling with Clang's -pdentic flag) ^^^^^^^^
This is the most ironic typo I've ever made. Thanks for the catch. I've corrected it.
A couple more suggestions.
List the platforms (& compilers) that you've tested on.
Compare (pros/cons) against other Unicode libs (like others have done elsewhere in this thread, i.e. https://news.ycombinator.com/item?id=42424637 and https://news.ycombinator.com/item?id=42424638)
Thanks for the suggestions. I think a comparison table would be useful, but I want to make sure I do it right since I'd be comparing my work to someone else's.
As for the compilers, I’ve tested the library with GCC, Clang, and MSVC, and with the -pedantic flag like the GP mentioned. The library should build with any standard-compliant C99 compiler.
- [deleted]
This is commercial only. Free and small is my safeclib, which does about half of it. ICU is not usable on small devices, and also pretty slow. It's much faster to use precomputed tables per algorithm, such as here or in safeclib. libunistring is also extremely slow. This was tried for grep and failed.
> This is commercial only.
You can use Unicorn for non-commercial use [1], but yes, for commercial use you need to buy a license.
> It's much faster to use precomputed tables per algorithm
You're absolutely right about using precomputed tables per algorithm. That is the secret to the library's speed.
> Free and small is my safeclib, which does about half of it.
I like safeclib! It's nice to hear from the author. It's worth distinguishing that safeclib is a safer string library whereas Unicorn is a Unicode algorithms library, not a string library.
[1] https://github.com/railgunlabs/unicorn/blob/master/LICENSE
Well, a string is unicode nowadays. And for sure not just a zero-terminated blob. That would be a buffer. Only the Linux kernel still holds this invalid view.
So every string library needs at least a compare function to find strings, with all the variants of same graphemes. Which leads us to NFC normalization for a start. Upcase tables and wordlength tables are also needed.
> This is commercial only. Free and small is my safeclib
Is it me or does this feel a bit weird? It seems like you're using the comments section here to self-advertise for exposure.
I read it like — "businesses can't use this without paying the OP, however, if you're a business you can get 50% of the way there by using _my_ library, and you don't even have to pay me!". It comes off incredibly rude to try to undercut the OP like this.
Self promotion in comments is perfectly fine on HN, as long as the comment is on topic and informative---both of which are true here.
In general I'd agree with that, but IMO the benefit of having a FOSS alternative to a proprietary product overrides it in this case.
See this is where I and the FSF/OSI diverge, because
> the right to use, copy, modify, merge, publish and distribute the Software [as long as you're not selling it or derivatives of the Software]
seems to line up with exactly what the folks involved in Free Software originally wanted — the ability to fix, patch, debug software that runs on their systems. I also think it's incredibly important to have non-commercial clauses given that the vast majority of technical infrastructure in the modern world is built on FOSS, all while the companies give nothing back and developers of FOSS starve.
If Valve can dump hundreds of developers into FOSS and within, what, 7 years? bring Linux almost to parity with and performance of Windows for gaming, imagine what would happen if FOSS developers were actually given funding!
This is not a comment about open/closed-source software and/or licensing models.
Projects like this never fail to impress me vis-a-vis source obfuscation. The 'generate.pyz' is an interesting twist on the usual practice.
This prohibition is void in certain relevant jurisdictions, for any publicly available product.# You may not reverse engineer, decompile, disassemble, or otherwise attempt # to derive the source code or underlying structure of this script
There is not much to show if I can’t read the source code.
You can download a prebuilt amalgamation from GitHub to see the amalgamated C code [1]. The GitHub repo contains the code that generates the amalgamation.
This looks interesting. Most embedded project I know use ICU/libicu for their unicode needs. As a potential customer I would like to know how does it compare against ICU for performance and code size. Why should I switch?
> I would like to know how does it compare against ICU
ICU is a large library, typically around ~40 MB depending on the platform, whereas Unicorn, with all features enabled, is only about 600 KB.
ICU has a broader scope: it's not just a Unicode library, but also an internationalization library. Unicorn, on the other hand, is specifically focused on Unicode algorithms.
ICU wasn't designed to be customized. It's also non-MISRA compliant and written in C++11. In contrast, Unicorn is written in C99, fully customizable, MISRA compliant, and only requires a few features from libc [1]. It's far more portable.
[1] https://github.com/railgunlabs/unicorn/?tab=readme-ov-file#u...
My comment is not directly related to the particular project which is impressive, but more to its presentation. If you go to the author's website, you will find neat to-the-point manuals and other useful information. This is what I call the real Web 3.0. Simple and to the point. Also the main company page is humorous in a good way, about the mad scientist etc.
I appreciate the kind words. When I was designing the website, I wanted to inject my personality into it. I wasn't sure how well the "playfulness" would go over, but I'm glad you enjoyed it.
Since MISRA is targetted at Automotive, as a software dev in the automotive space I would suggest adding the note that this is able to run on POSIX compliant OSes like QNX :)
If you would like to chat, hit me up.
I don’t get the whole MISRA requirement that functions should only have one exit point. Honestly, nobody has been able to explain why this is important, other than it’s a historical anomaly inherited from FORTRAN. (Which was actually for a good reason)
That's one of many rules in MISRA that originate from antiquated "best practices" from the dark ages that don't actually improve safety. We have it today by way of IEC 61508, which gets it from a book on structured programming called Structured Design. That book didn't recommend banning multiple exit points, but it recommended minimizing them to simplify the control flow graph and said code should minimize the distance between black boxes (bits of code that do something without leaky abstractions and have only one return statement). The IEC authors and MISRA thought the logical extension of that was to make everything have one exit point.
I recall reading a study showing that MISRA actually tended to _increase_ the average number of bugs in software projects.
This is an old rule from when structured programming was introduced. The prior state of affairs was that code would jump via gotos between functions to different labels within those functions (labels were global). The requirement that every function should have only a single entry point and a single exit point seemed like a good rule to establish sanity.
MISRA C states the following rationale:
“A single point of exit is required by IEC 61508 and ISO 26262 as part of the requirements for a modular approach.
Early returns may lead to the unintentional omission of function termination code.
If a function has exit points interspersed with statements that produce persistent side effects, it is not easy to determine which side effects will occur when the function is executed.”
Note that the MISRA C rule is merely advisory, meaning it is a recommendation and not a hard requirement (i.e. it’s a “should” and not a “shall”).
Having been only lightly exposed to MISRA, my impression is that in MISRA the "Required" label doesn't mean "You must always do this"; rather it means, "You must do this or document why it's necessary and safe not to do it".
It's a bit like an operating system write-protecting pages. Sometimes you write-protect pages that the process really shouldn't write to, like shared libraries or something like that. But sometimes you write-protect pages that you actually expect a program to write to, like memory-mapped pages, because then when the write happens it triggers something else (like copy-on-write or marking a page dirty or something).
Rules like "Don't use dynamically allocated memory" are one example of this -- not that they really expect you never to do it, but that marking it "Required" is a way to force you to document how you plan to make it safe.
Similarly, if it's easier to rearrange a function to have only a single exit point than to explain why you need multiple exit points, just rearrange it; you really need multiple exit points, just document why.
This is true, in most of these things you can document your way around the problem, though sometimes you also have a third-party to convince who may or may not be reasonable. But either way, MISRA is a collection of stating the obvious ("don't do things the language says you shouldn't do" is like fully half of the list) and arbitrary restrictions that have little justification, so the fact that you can document your way around it still basically means you're doing extra work for no real benefit (because it's real difficult within such a conservative field to say "hey, that industry standard? It's crap, has always been crap, and we're going to ignore it").
Lots of MISRA could is proprietary and receives no upstream patches from customers. Still, it's not unusual to deliver a source blob to your customer instead of a binary blob, often for debugging. (But sometimes only the binary blob has the blessing of the vendor, so you can only use the binary blob in your released product)
I would not be in the least surprised if someone has a compiler/transpiler from a higher level language to some C code which checks all MISRA boxes.
One reason to prefer it in C is to be able to easily add locally scoped functionality like profiling markers and temp allocators.
profile_begin("func"); a = temp_arena_begin(); // ... code temp_arena_end(); profile_end();
the abominations I've seen in code review from people trying to fullfil this rule still wake me up at night
Yeah, in this particular case MISRAC is doing more harm than good.
I wish we could get an update on these rules, but this issue has been brought up many many times bwfore and has always been brushed away without a proper analysis.
FORTRAN II introduced the RETURN statement.
Languages like Matlab, where the values returned are listed at the top of the function and you don’t even need a return statement to tell it what to return, always feel so funky and fun.
one reason I can think of from the top of my head (although I never had to deal with MISRA C at all) is that if you have to add some cleanup code before your function returns, then there is exactly one place and one place only to do that.
Otherwise this leads to duplication of cleanup code similar to
allocate_something() .. if failed(foo) { deallocate_something() return FAILED; } .. deallocate_something() return SUCCESS;
This is, more than anything, an argument for a `defer` statement, of the sort you can enjoy in Zig right now.
Or hopefully, eventually, in C, thanks to the tireless efforts of JeanHeyde Meneide:
https://thephd.dev/just-put-raii-in-c-bro-please-bro-just-on...
> certain name manglings are not guaranteed to be 1:1 and can infact “demangle” into multiple different plausible entities.
Now I'm really curious, doesn't that mean some valid C++ code would fail to link for having multiple definitions of the same symbol??
I would expect name mangling to be a bijection from function prototype to string.
MISRA C can’t mandate new language features though.
The MISRA people work with the C/C++ committees on upcoming language changes, which gives them loud voices to push things they want promoted.
It uses a privative license if you're wondering.
Nice!
But not interesting for me in any way since it’s not open source.
Unfortunately, the entire reason I didn't release Unicorn under an OSI approved license is because I see many (most?) FOSS projects are chronically underfunded. Now, I did not quit my job and build this to get rich or anything, but I do need to earn enough to sustain myself. If there's enough interest, I would consider crowdfunding a release under an OSI license.
Why not dual license it under a commercial license and something like GPL?
I went back and forth on this and in my uncertainty I decided it was better to start more "closed" first with the potential to become more "open" over time.
Therefore it will never be open source, and if so then only when you lost interest in the project. Got it.
Thank you, let me know if you have any questions.
Who is your main target audience?
Primarily, companies developing for embedded systems or other resource constrained devices.