Tanstack Start | Show HN: I built an embeddable Unicode library with MISRA C conformance

Show HN: I built an embeddable Unicode library with MISRA C conformance(railgunlabs.com)

117 points by hgs3 7 months ago | 53 comments

Hello, everyone. I built Unicorn: an embeddable MISRA C:2012 implementation of essential Unicode Algorithms.

Unicorn is designed to be fully customizable: you can select which Unicode algorithms and character properties are included or excluded from compilation. You can also exclude Unicode character blocks wholesale for scripts your application does not support. It's perfect for resource constrained devices like microcontrollers and IoT devices.

About me: I quit my Big Corp job a few years back to pursue my passion for software development and this is one of my first commercial releases.

Someone 7 months ago
On https://railgunlabs.com/unicorn/manual/misra-compliance/, I think you will want to fix a typo in
```
  1.2    Required    Compliant (verified by compiling with Clang's -pdentic flag)
                                                                   ^^^^^^^^
```
Or am I too pedantic?
- hgs3 7 months ago |parent
  This is the most ironic typo I've ever made. Thanks for the catch. I've corrected it.
  - canucker2016 7 months ago |parent
    A couple more suggestions.
    List the platforms (& compilers) that you've tested on.
    Compare (pros/cons) against other Unicode libs (like others have done elsewhere in this thread, i.e. https://news.ycombinator.com/item?id=42424637 and https://news.ycombinator.com/item?id=42424638)
    - hgs3 7 months ago |parent
      Thanks for the suggestions. I think a comparison table would be useful, but I want to make sure I do it right since I'd be comparing my work to someone else's.
      As for the compilers, I’ve tested the library with GCC, Clang, and MSVC, and with the -pedantic flag like the GP mentioned. The library should build with any standard-compliant C99 compiler.
- 7 months ago |parent
  [deleted]
  - 7 months ago |parent
    [deleted]
chris_wot 7 months ago
I don’t get the whole MISRA requirement that functions should only have one exit point. Honestly, nobody has been able to explain why this is important, other than it’s a historical anomaly inherited from FORTRAN. (Which was actually for a good reason)
- AlotOfReading 7 months ago |parent
  That's one of many rules in MISRA that originate from antiquated "best practices" from the dark ages that don't actually improve safety. We have it today by way of IEC 61508, which gets it from a book on structured programming called Structured Design. That book didn't recommend banning multiple exit points, but it recommended minimizing them to simplify the control flow graph and said code should minimize the distance between black boxes (bits of code that do something without leaky abstractions and have only one return statement). The IEC authors and MISRA thought the logical extension of that was to make everything have one exit point.
  - elcritch 7 months ago |parent
    I recall reading a study showing that MISRA actually tended to _increase_ the average number of bugs in software projects.
- layer8 7 months ago |parent
  This is an old rule from when structured programming was introduced. The prior state of affairs was that code would jump via gotos between functions to different labels within those functions (labels were global). The requirement that every function should have only a single entry point and a single exit point seemed like a good rule to establish sanity.
  MISRA C states the following rationale:
  “A single point of exit is required by IEC 61508 and ISO 26262 as part of the requirements for a modular approach.
  Early returns may lead to the unintentional omission of function termination code.
  If a function has exit points interspersed with statements that produce persistent side effects, it is not easy to determine which side effects will occur when the function is executed.”
  Note that the MISRA C rule is merely advisory, meaning it is a recommendation and not a hard requirement (i.e. it’s a “should” and not a “shall”).
- gwd 7 months ago |parent
  Having been only lightly exposed to MISRA, my impression is that in MISRA the "Required" label doesn't mean "You must always do this"; rather it means, "You must do this or document why it's necessary and safe not to do it".
  It's a bit like an operating system write-protecting pages. Sometimes you write-protect pages that the process really shouldn't write to, like shared libraries or something like that. But sometimes you write-protect pages that you actually expect a program to write to, like memory-mapped pages, because then when the write happens it triggers something else (like copy-on-write or marking a page dirty or something).
  Rules like "Don't use dynamically allocated memory" are one example of this -- not that they really expect you never to do it, but that marking it "Required" is a way to force you to document how you plan to make it safe.
  Similarly, if it's easier to rearrange a function to have only a single exit point than to explain why you need multiple exit points, just rearrange it; you really need multiple exit points, just document why.
  - rcxdude 7 months ago |parent
    This is true, in most of these things you can document your way around the problem, though sometimes you also have a third-party to convince who may or may not be reasonable. But either way, MISRA is a collection of stating the obvious ("don't do things the language says you shouldn't do" is like fully half of the list) and arbitrary restrictions that have little justification, so the fact that you can document your way around it still basically means you're doing extra work for no real benefit (because it's real difficult within such a conservative field to say "hey, that industry standard? It's crap, has always been crap, and we're going to ignore it").
- champijone 7 months ago |parent
  One reason to prefer it in C is to be able to easily add locally scoped functionality like profiling markers and temp allocators.
```
  profile_begin("func");
  a = temp_arena_begin();
  // ... code
  temp_arena_end();
  profile_end();
```
- aulin 7 months ago |parent
  the abominations I've seen in code review from people trying to fullfil this rule still wake me up at night
  - daghamm 7 months ago |parent
    Yeah, in this particular case MISRAC is doing more harm than good.
    I wish we could get an update on these rules, but this issue has been brought up many many times bwfore and has always been brushed away without a proper analysis.
- actionfromafar 7 months ago |parent
  Lots of MISRA could is proprietary and receives no upstream patches from customers. Still, it's not unusual to deliver a source blob to your customer instead of a binary blob, often for debugging. (But sometimes only the binary blob has the blessing of the vendor, so you can only use the binary blob in your released product)
  I would not be in the least surprised if someone has a compiler/transpiler from a higher level language to some C code which checks all MISRA boxes.
- dark-star 7 months ago |parent
  one reason I can think of from the top of my head (although I never had to deal with MISRA C at all) is that if you have to add some cleanup code before your function returns, then there is exactly one place and one place only to do that.
  Otherwise this leads to duplication of cleanup code similar to
```
  allocate_something()
  ..
  if failed(foo) {
    deallocate_something()
    return FAILED;
  }
  ..
  deallocate_something()
  return SUCCESS;
```
  - samatman 7 months ago |parent
    This is, more than anything, an argument for a `defer` statement, of the sort you can enjoy in Zig right now.
    Or hopefully, eventually, in C, thanks to the tireless efforts of JeanHeyde Meneide:
    https://thephd.dev/just-put-raii-in-c-bro-please-bro-just-on...
    - eddd-ddde 7 months ago |parent
      > certain name manglings are not guaranteed to be 1:1 and can infact “demangle” into multiple different plausible entities.
      Now I'm really curious, doesn't that mean some valid C++ code would fail to link for having multiple definitions of the same symbol??
      I would expect name mangling to be a bijection from function prototype to string.
    - layer8 7 months ago |parent
      MISRA C can’t mandate new language features though.
      - AlotOfReading 7 months ago |parent
        The MISRA people work with the C/C++ committees on upcoming language changes, which gives them loud voices to push things they want promoted.
- pklausler 7 months ago |parent
  FORTRAN II introduced the RETURN statement.
  - bee_rider 7 months ago |parent
    Languages like Matlab, where the values returned are listed at the top of the function and you don’t even need a return statement to tell it what to return, always feel so funky and fun.
rubicks 7 months ago
This is not a comment about open/closed-source software and/or licensing models.
Projects like this never fail to impress me vis-a-vis source obfuscation. The 'generate.pyz' is an interesting twist on the usual practice.
- layer8 7 months ago |parent
```
    #  You may not reverse engineer, decompile, disassemble, or otherwise attempt
    #  to derive the source code or underlying structure of this script
```
  This prohibition is void in certain relevant jurisdictions, for any publicly available product.
__turbobrew__7 months ago
There is not much to show if I can’t read the source code.
- hgs3 7 months ago |parent
  You can download a prebuilt amalgamation from GitHub to see the amalgamated C code [1]. The GitHub repo contains the code that generates the amalgamation.
  [1] https://github.com/railgunlabs/unicorn/releases/
kiritanpo 7 months ago
This looks interesting. Most embedded project I know use ICU/libicu for their unicode needs. As a potential customer I would like to know how does it compare against ICU for performance and code size. Why should I switch?
- hgs3 7 months ago |parent
  > I would like to know how does it compare against ICU
  ICU is a large library, typically around ~40 MB depending on the platform, whereas Unicorn, with all features enabled, is only about 600 KB.
  ICU has a broader scope: it's not just a Unicode library, but also an internationalization library. Unicorn, on the other hand, is specifically focused on Unicode algorithms.
  ICU wasn't designed to be customized. It's also non-MISRA compliant and written in C++11. In contrast, Unicorn is written in C99, fully customizable, MISRA compliant, and only requires a few features from libc [1]. It's far more portable.
  [1] https://github.com/railgunlabs/unicorn/?tab=readme-ov-file#u...
  - ranger_danger 7 months ago |parent
    The most important difference to me, which is a deal-breaker, is that Unicorn is non-commercial.
    Note that I am not interested in actually using Unicorn commercially, but my understanding is that this restriction makes the library incompatible with FOSS licenses such as GPL.
garganzol 7 months ago
My comment is not directly related to the particular project which is impressive, but more to its presentation. If you go to the author's website, you will find neat to-the-point manuals and other useful information. This is what I call the real Web 3.0. Simple and to the point. Also the main company page is humorous in a good way, about the mad scientist etc.
- hgs3 7 months ago |parent
  I appreciate the kind words. When I was designing the website, I wanted to inject my personality into it. I wasn't sure how well the "playfulness" would go over, but I'm glad you enjoyed it.
biosboiii 7 months ago
Since MISRA is targetted at Automotive, as a software dev in the automotive space I would suggest adding the note that this is able to run on POSIX compliant OSes like QNX :)
If you would like to chat, hit me up.
tocariimaa 7 months ago
It uses a privative license if you're wondering.
7 months ago
[deleted]
rurban 7 months ago
This is commercial only. Free and small is my safeclib, which does about half of it. ICU is not usable on small devices, and also pretty slow. It's much faster to use precomputed tables per algorithm, such as here or in safeclib. libunistring is also extremely slow. This was tried for grep and failed.
- 4 months ago |parent
  [deleted]
- hgs3 7 months ago |parent
  > This is commercial only.
  You can use Unicorn for non-commercial use [1], but yes, for commercial use you need to buy a license.
  > It's much faster to use precomputed tables per algorithm
  You're absolutely right about using precomputed tables per algorithm. That is the secret to the library's speed.
  > Free and small is my safeclib, which does about half of it.
  I like safeclib! It's nice to hear from the author. It's worth distinguishing that safeclib is a safer string library whereas Unicorn is a Unicode algorithms library, not a string library.
  [1] https://github.com/railgunlabs/unicorn/blob/master/LICENSE
  - rurban 7 months ago |parent
    Well, a string is unicode nowadays. And for sure not just a zero-terminated blob. That would be a buffer. Only the Linux kernel still holds this invalid view.
    So every string library needs at least a compare function to find strings, with all the variants of same graphemes. Which leads us to NFC normalization for a start. Upcase tables and wordlength tables are also needed.
- fao_7 months ago |parent
  > This is commercial only. Free and small is my safeclib
  Is it me or does this feel a bit weird? It seems like you're using the comments section here to self-advertise for exposure.
  I read it like — "businesses can't use this without paying the OP, however, if you're a business you can get 50% of the way there by using _my_ library, and you don't even have to pay me!". It comes off incredibly rude to try to undercut the OP like this.
  - ykonstant 7 months ago |parent
    Self promotion in comments is perfectly fine on HN, as long as the comment is on topic and informative---both of which are true here.
  - josephcsible 7 months ago |parent
    In general I'd agree with that, but IMO the benefit of having a FOSS alternative to a proprietary product overrides it in this case.
    - fao_7 months ago |parent
      See this is where I and the FSF/OSI diverge, because
      > the right to use, copy, modify, merge, publish and distribute the Software [as long as you're not selling it or derivatives of the Software]
      seems to line up with exactly what the folks involved in Free Software originally wanted — the ability to fix, patch, debug software that runs on their systems. I also think it's incredibly important to have non-commercial clauses given that the vast majority of technical infrastructure in the modern world is built on FOSS, all while the companies give nothing back and developers of FOSS starve.
      If Valve can dump hundreds of developers into FOSS and within, what, 7 years? bring Linux almost to parity with and performance of Windows for gaming, imagine what would happen if FOSS developers were actually given funding!
sushidev 7 months ago
Nice!
- sushidev 7 months ago |parent
  But not interesting for me in any way since it’s not open source.
  - hgs3 7 months ago |parent
    Unfortunately, the entire reason I didn't release Unicorn under an OSI approved license is because I see many (most?) FOSS projects are chronically underfunded. Now, I did not quit my job and build this to get rich or anything, but I do need to earn enough to sustain myself. If there's enough interest, I would consider crowdfunding a release under an OSI license.
    - kouteiheika 7 months ago |parent
      Why not dual license it under a commercial license and something like GPL?
      - hgs3 7 months ago |parent
        I went back and forth on this and in my uncertainty I decided it was better to start more "closed" first with the potential to become more "open" over time.
        h4ck_th3_pl4n3t 7 months ago |parent
        Therefore it will never be open source, and if so then only when you lost interest in the project. Got it.
- hgs3 7 months ago |parent
  Thank you, let me know if you have any questions.
  - sushidev 7 months ago |parent
    Who is your main target audience?
    - hgs3 7 months ago |parent
      Primarily, companies developing for embedded systems or other resource constrained devices.
- 7 months ago |parent
  [deleted]