Tanstack Start | My iPhone 16 Pro Max produces garbage output when running MLX LLMs

My iPhone 16 Pro Max produces garbage output when running MLX LLMs(journal.rafaelcosta.me)

304 points by rafaelcosta 14 hours ago | 96 comments

csmantle 11 hours ago
Methodology is one thing; I can't really agree that deploying an LLM to do sums is great. Almost as hilarious as asking "What's moon plus sun?"
But phenomenon is another thing. Apple's numerical APIs are producing inconsistent results on a minority of devices. This is something worth Apple's attention.
- JimboOmega 9 hours ago |parent
  (This is a total digression, so apologies)
  My mind instantly answered that with "bright", which is what you get when you combine the sun and moon radicals to make 明(https://en.wiktionary.org/wiki/%E6%98%8E)
  Anyway, that question is not without reasonable answers. "Full Moon" might make sense too. No obvious deterministic answer, though, naturally.
  - awesome_dude 7 hours ago |parent
    FTR the Full Moon was exactly 5 hours ago (It's not without humour that this conversation occurs on the day of the full moon :)
- nonoesp 5 minutes ago |parent
  > What's moon plus sun?
  "Monsoon," says ChatGPT.
- idk1 2 hours ago |parent
  As an aside, one of my very nice family members like tarot card reading, and I think you'd get an extremely different answer for - "What's moon plus sun?" - something like I would guess as they're opposites - "Mixed signals or insecurity get resolved by openness and real communication." - It's kind of fascinating, the range of answers to that question. As a couple of other people have mentioned, it could mean loads of things. I thought I'd add one in there.
  I'll just add that if you think this advice applies to you, it's the - https://en.wikipedia.org/wiki/Barnum_effect
- CrispinS 9 hours ago |parent
  > What's moon plus sun?
  Eclipse, obviously.
  - christophilus 9 hours ago |parent
    That’s sun minus moon. Moon plus sun is a wildly more massive, nuclear furnace of a moon that also engulfs the earth.
    - SauntSolaire 5 hours ago |parent
      Reminds me of this AI word combination game recently shared on HN, with almost exactly these mechanics:
      https://neal.fun/infinite-craft/
      For the record, Sun+Moon is indeed eclipse.
    - fsckboy 5 hours ago |parent
      >Moon plus sun is a wildly more massive, nuclear furnace of a moon that also engulfs the earth.
      i just looked up mass of sun vs mass of moon (they differ by 10^30 vs 10^20), and the elemental composition of the sun: the moon would entirely disappear into the insignificant digits of trace elements which are in the range of .01 % of the sun. I could be off by orders of magnitude all over the place and it would still disappear.
    - mcny 7 hours ago |parent
      Wait so moon plus sun != sun plus moon? :Thinking:
      - chii 5 hours ago |parent
        celestial objects don't need to obey algebraic commutativity!
        direwolf20 17 minutes ago |parent
        I wonder if SCP-1313 does
    - dcrazy 8 hours ago |parent
      This thread reminds me of Scribblenauts, the game where you conjure objects to solve puzzles by describing them. I suspect it was an inspiration for Baba Is You.
      - Der_Einzige 8 hours ago |parent
        Scribblenauts was also an early precursor to modern GenAI/word embeddings. I constantly bring it up in discussions of the history of AI for this reason.
        veqq 5 hours ago |parent
        Could you explain? :3
    - lkjdsklf 5 hours ago |parent
      Here i was, like an idiot, thinking it was moon light
    - AuryGlenz 7 hours ago |parent
      Or potentially a sun that lasts slightly longer?
  - IsTom 31 minutes ago |parent
    INSUFFICIENT DATA FOR MEANINGFUL ANSWER.
  - jraph 2 hours ago |parent
    Moon plus sun would be sun because the sun would be an absorbing element.
    - acka 2 hours ago |parent
      Moon implies there is a planet the moon is orbiting. So unless the planet and its moon are too close to the sun the long term result could also be: solar system.
      - jraph an hour ago |parent
        This goes to show how that plus operation is awfully defined.
  - geuis 8 hours ago |parent
    Not obvious. Astronomers are actively looking for signatures of exomoons around exoplanets. So "sun plus moon" could mean that too.
    - xattt 8 hours ago |parent
      The OP said moon + sun, rather than sun + moon. We have no idea yet if celestial math is non-communicative.
      - BenjiWiebe 6 hours ago |parent
        *commutative
        tbrownaw 6 hours ago |parent
        Well, that too.
    - godelski 5 hours ago |parent
      Well you find the signature by looking for a dip in but sun's luminosity. So minus might be the better relationship here
tgma 2 hours ago
The author is assuming Metal is compiled to ANE in MLX. MLX is by-and-large GPU-based and not utilizing ANE, barring some community hacks.
DustinEchoes 10 hours ago
I wish he would have tried on a different iPhone 16 Pro Max to see if the defect was specific to that individual device.
- crossroadsguy 9 hours ago |parent
  So true! And as any sane Apple user or the standard template Apple Support person would have suggested (and as they actually suggest) - did they try reinstalling the OS from scratch after having reset the data (of course before backing it up; preferably with a hefty iCloud+ plan)? Because that's the thing to do in such issues and it's very easy.
  - post-it 6 hours ago |parent
    Reinstalling the OS sucks. I need to pull all my bank cards out of my safe and re-add their CVV's to the wallet, and sometimes authenticate over the phone. And re-register my face. And log back in to all my apps. It can take an hour or so, except it's spread out over weeks as I open an app and realize I need to log in a dozen times.
    - RulerOf 2 minutes ago |parent
      [delayed]
- jajuuka 6 hours ago |parent
  Latest update at the bottom of the page.
  "Well, now it's Feb. 1st and I have an iPhone 17 Pro Max to test with and... everything works as expected. So it's pretty safe to say that THAT specific instance of iPhone 16 Pro Max was hardware-defective."
  - Someone 3 hours ago |parent
    That logic is somewhat [1] correct, but it doesn’t say anything about whether all, some, or only this particular iPhone 16 Pro Maxes are hardware-defective.
    [1] as the author knows (“MLX uses Metal to compile tensor operations for this accelerator. Somewhere in that stack, the computations are going very wrong”) there’s lots of soft- and firmware in-between the code being run and the hardware of the neural engine. The issue might well be somewhere in those.
Buttons840 12 hours ago
I clicked hoping this would be about how old graphing calculators are generally better math companions than a phone.
The best way to do math on my phone I know of is the HP Prime emulator.
- watersb 5 hours ago |parent
  PCalc -- because it runs on every Apple platform since the Mac Classic:
  https://pcalc.com/mac/thirty.html
  My other favorite calculator is free42, or its larger display version plus42
  https://thomasokken.com/plus42/
  For a CAS tool on a pocket mobile device, I haven't found anything better than MathStudio (formerly SpaceTime):
  https://mathstud.io
  You can run that in your web browser, but they maintain a mobile app version. It's like a self-hosted Wolfram Alpha.
  - Melatonic 3 hours ago |parent
    The last one was interesting but both apps haven't been updated in 4 years. Hard to pay for something like that.
    They do have some new AI math app that's regularly updated
- xoa 9 hours ago |parent
  My personal favorite is iHP48 (previously I used m48+ before it died) running an HP 48GX with metakernal installed as I used through college. Still just so intuitive and fast to me.
  - wolvoleo 8 hours ago |parent
    I still have mine. Never use it though as I'm not handy with RPN anymore. :'(
- xp84 9 hours ago |parent
  I was pretty delighted to realize I could now delete the lame Calculator.app from my iPhone and replace it with something of my choice. For now I've settled on NumWorks, which is apparently an emulator of a modern upstart physical graphing calc that has made some inroads into schools. And of course, you can make a Control Center button to launch an app, so that's what I did.
  Honestly, the main beef I have with Calculator.app is that on a screen this big, I ought to be able to see several previous calculations and scroll up if needed. I don't want an exact replica of a 1990s 4-function calculator like the default is (ok, it has more digits and the ability to paste, but besides that, adds almost nothing).
  - Buttons840 9 hours ago |parent
    I looked at that calculator. But HP Prime and TI-89 have CAS systems that can do symbolic math, so I prefer to emulate them.
- VorpalWay 12 hours ago |parent
  I run a TI 83+ emulator on my Android phone when I don't have my physical calculator at hand. Same concept, just learned a different brand of calculators.
  - varun_ch 10 hours ago |parent
    built-in calculator apps are surprisingly underbaked... I'm surprised neither of the big two operating systems have elected to ship something comparable to a real calculator built in. It would be nice if we could preview the whole expression as we type it..
    I use the NumWorks emulator app whenever I need something more advanced. It's pretty good https://www.numworks.com/simulator/
    - josephg 3 hours ago |parent
      That’s certainly an improvement - but why can’t I modify a previous expression? Or tap to select previous expressions?
      What I want is something like a repl. I want to be able to return to an earlier expression, modify it, assign it to a variable, use that variable in another expression, modify the variable and rerun and so on.
- realityfactchex 8 hours ago |parent
  GraphNCalc83 is awesome [0].
  [0] https://apps.apple.com/us/app/graphncalc83/id744882019
- nickorlow 6 hours ago |parent
  Anytime I have to do some serious amount of math, I have to go dig around and find my TI-84, everything is just burned into muscle memory
- shiroiuma 7 hours ago |parent
  I use the "RealCalc" app on my phone. It's pretty similar to my old HP48.
raincole 12 hours ago
Low level numerical operation optimizations are often not reproduceable. For example: https://www.intel.com/content/dam/develop/external/us/en/doc... (2013)
But it's still surprising that that LLM doesn't work on iPhone 16 at all. After all LLMs are known for their tolerance to quantization.
- bri3d 12 hours ago |parent
  Yes, "floating point accumulation doesn't commute" is a mantra everyone should have in their head, and when I first read this article, I was jumping at the bit to dismiss it out of hand for that reason.
  But, what got me about this is that:
  * every other Apple device delivered the same results
  * Apple's own LLM silently failed on this device
  to me that behavior suggests an unexpected failure rather than a fundamental issue; it seems Bad (TM) that Apple would ship devices where their own LLM didn't work.
  - DavidVoid 13 minutes ago |parent
    I would go even further and state that "you should never assume that floating point functions will evaluate the same on two different computers, or even on two different versions of the same application", as the results of floating point evaluations can differ depending on platform, compiler optimizations, compilation-flags, run-time FPU environment (rounding mode, &c.), and even memory alignment of run-time data.
    There's a C++26 paper about compile time math optimizations with a good overview and discussion about some of these issues [P1383]. The paper explicitly states:
    1. It is acceptable for evaluation of mathematical functions to differ between translation time and runtime.
    2. It is acceptable for constant evaluation of mathematical functions to differ between platforms.
    So C++ has very much accepted the fact that floating point functions should not be presumed to give identical results in all circumstances.
    Now, it is of course possible to ensure that floating point-related functions give identical results on all your target machines, but it's usually not worth the hassle.
    [P1383]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p13...
  - sva_10 hours ago |parent
    > floating point accumulation doesn't commute
    It is commutative (except for NaN). It isn't associative though.
    - ekelsen 9 hours ago |parent
      I think it commutes even when one or both inputs are NaN? The output is always NaN.
      - DavidVoid 2 hours ago |parent
        Unless you compile with fast-math ofc, because then the compiler will assume that NaN never occurs in the program.
      - addaon 8 hours ago |parent
        NaNs are distinguishable. /Which/ NaN you get doesn't commute.
        ekelsen 8 hours ago |parent
        I guess at the bit level, but not at the level of computation? Anything that relies on bit patterns of nans behaving in a certain way (like how they propagate) is in dangerous territory.
        addaon 7 hours ago |parent
        > Anything that relies on bit patterns of nans behaving in a certain way (like how they propagate) is in dangerous territory.
        Why? This is well specified by IEEE 754. Many runtimes (e.g. for Javascript) use NaN boxing. Treating floats as a semi-arbitrary selection of rational numbers plus a handful of special values is /more/ correct than treating them as real numbers, but treating them as actually specified does give more flexibility and power.
        ekelsen 6 hours ago |parent
        Can you show me where in the ieee spec this is guaranteed?
        My understanding is the exact opposite - that it allows implementations to return any NaN value at all. It need not be any that were inputs.
        It may be that JavaScript relies on it and that has become more binding than the actual spec, but I don't think the spec actually guarantees this.
        Edit: actually it turns out nan-boxing does not involve arithmetic, which is why it works. I think my original point stands, if you are doing something that relies on how bit values of NaNs are propagated during arithmetic, you are on shaky ground.
        addaon 5 hours ago |parent
        Don't have the spec handy, but specifically binary operations combining two NaN inputs must result in one of the input NaNs. For all of Intel SSE, AMD SSE, PowerPC, and ARM, the left hand operand is returned if both are signaling or both or quiet. x87 does weird things (but when doesn't it?), and ARM does weird things when mixing signaling and quiet NaNs.
        ekelsen 4 hours ago |parent
        I also don't have access to the spec, but the people writing Rust do and they claim this: "IEEE makes almost no guarantees about the sign and payload bits of the NaN"
        https://rust-lang.github.io/rfcs/3514-float-semantics.html
        See also this section of wikipedia https://en.wikipedia.org/wiki/NaN#Canonical_NaN
        "On RISC-V, most floating-point operations only ever generate the canonical NaN, even if a NaN is given as the operand (the payload is not propagated)."
        And from the same article:
        "IEEE 754-2008 recommends, but does not require, propagation of the NaN payload." (Emphasis mine)
        I call bullshit on the statement "specifically binary operations combining two NaN inputs must result in one of the input NaNs." It is definitely not in the spec.
        j16sdiz an hour ago |parent
        Blame the long and confusing language in spec:
        > For an operation with quiet NaN inputs, other than maximum and minimum operations, if a floating-point result is to be delivered the result shall be a quiet NaN which should be one of the input NaNs.
        The same document say:
        > shall -- indicates mandatory requirements strictly to be followed in order to conform to the standard and from which no deviation is permitted (“shall” means “is required to”)
        > should -- indicates that among several possibilities, one is recommended as particularly suitable, without mentioning or excluding others; or that a certain course of action is preferred but not necessarily required; or that (in the negative form) a certain course of action is deprecated but not prohibited (“should” means “is recommended to”)
        i.e. It required to be a quiet NaN, and recommended to use one of the input NaN.
  - danpalmer 11 hours ago |parent
    FYI, the saying is "champing at the bit", it comes from horses being restrained.
    - mylifeandtimes 9 hours ago |parent
      hey, I appreciate your love of language and sharing with us.
      I'm wondering if we couldn't re-think "bit" to the computer science usage instead of the thing that goes in the horse's mouth, and what it would mean for an AI agent to "champ at the bit"?
      What new sayings will we want?
      - nilamo 9 hours ago |parent
        Byting at the bit?
    - odo1242 9 hours ago |parent
      chomping at the bit
      - danpalmer 9 hours ago |parent
        Actually it was originally "champing" – to grind or gnash teeth. The "chomping" (to bite) alternative cropped up more recently as people misheard and misunderstood, but it's generally accepted as an alternative now.
        kortilla 8 hours ago |parent
        It’s actually accepted as the primary now and telling people about “champing” is just seen as archaic.
        danpalmer 8 hours ago |parent
        Do you have a source on this, or a definition for what it means to be "primary" here? All I can find is sources confirming that "champing" is the original and more technically correct, but that "chomping" is an accepted variant.
  - BeetleB 9 hours ago |parent
    As a sister comment said, floating point computations are commutative, but not associative.
    a * b = b * a for all "normal" floating point numbers.
johngossman 11 hours ago
Posting some code that reproduces the bug could help not only Apple but you and others.
watt an hour ago
Does it bother anyone else that the author drops "MiniMax" there in the article without bothering to explain or footnote what that is? (I could look it up, but I think article authors should call out these things).
docfort 4 hours ago
Interesting post, but the last bit of logic pointing to the Neural Engine for MLX doesn’t hold up. MLX supports running on CPU, Apple GPU via Metal, and NVIDIA GPU via CUDA: https://github.com/ml-explore/mlx/tree/main/mlx/backend
_kulang 11 hours ago
Maybe this is why my damn keyboard predictive text is so gloriously broken
- sen 11 hours ago |parent
  Oh it's not just me?
  Typing on my iPhone in the last few months (~6 months?) has been absolutely atrocious. I've tried disabling/enabling every combination of keyboard setting I can thinkj of, but the predictive text just randomly breaks or it just gives up and stops correcting anything at all.
  - macintux 10 hours ago |parent
    I haven't watched the video, but clearly there's a broad problem with the iOS keyboard recently.
    https://news.ycombinator.com/item?id=46232528 ("iPhone Typos? It's Not Just You - The iOS Keyboard is Broken")
  - acdha 10 hours ago |parent
    It’s not just you, and it got bad on my work iPhone at the same time so I know it’s not failing hardware or some customization since I keep that quite vanilla.
- taneq 10 hours ago |parent
  It’s gotten so bad that I’m half convinced it’s either (a) deliberately trolling, or (b) ‘optimising’ for speech to text adoption.
swyx 3 hours ago
> Update on Feb. 1st: > Well, now it's Feb. 1st and I have an iPhone 17 Pro Max to test with and... everything works as expected. So it's pretty safe to say that THAT specific instance of iPhone 16 Pro Max was hardware-defective.
nothing to see here.
mungoman2 5 hours ago
Good article. Would have liked to see them create a minimal test case, to conclusively show that the results of math operations are actually incorrect.
Metacelsus 9 hours ago
>"What is 2+2?" apparently "Applied.....*_dAK[...]" according to my iPhone
At least the machine didn't say it was seven!
- tolciho 7 hours ago |parent
  Maybe Trurl and Klapaucius were put in charge of Q&A.
thinkbud an hour ago
So the LLM is working as intended?
nickorlow 6 hours ago
I'd think other neural-engine using apps would also have weird behavior. Would've been interesting to try a few App Store apps and see the weird behavior
bri3d 12 hours ago
I love to see real debugging instead of conspiracy theories!
Did you file a radar? (silently laughing while writing this, but maybe there's someone left at Apple who reads those)
- djmips 4 hours ago |parent
  IKR - this is very typical
z3t4 3 hours ago
neural nets or AI are very bad at math, it can only produce what's in the training data. So if you have trained it from 1+1 to 8+8 it can't do 9+9, it's not like a child brain that it can make logical conclusions.
ftyghome 8 hours ago
I also would like to see if the same error happens in another phone with the exactly same model.
refulgentis 11 hours ago
.
- bri3d 11 hours ago |parent
  Can you read the article a little more closely?
  > - MiniMax can't fit on an iPhone.
  They asked MiniMax on their computer to make an iPhone app that didn't work.
  It didn't work using the Apple Intelligence API. So then:
  * They asked Minimax to use MLX instead. It didn't work.
  * They Googled and found a thread where Apple Intelligence also didn't work for other people, but only sometimes.
  * They HAND WROTE the MLX code. It didn't work. They isolated the step where the results diverged.
  > Better to dig in a bit more.
  The author already did 100% of the digging and then some.
  Look, I am usually an AI rage-enthusiast. But in this case the author did every single bit of homework I would expect and more, and still found a bug. They rewrote the test harness code without an LLM. I don't find the results surprising insofar as that I wouldn't expect MAC to converge across platforms, but the fact that Apple's own LLM doesn't work on their hardware and their own is an order of magnitude off is a reasonable bug report, in my book.
  - refulgentis 11 hours ago |parent
    Emptied out post, thanks for the insight!
    Fascinating the claim is Apple Intelligence doesn't work altogether. Quite a scandal.
    EDIT: If you wouldn't mind, could you edit out "AI rage enthusiast" you edited in? I understand it was in good humor, as you describe yourself that way as well. However, I don't want to eat downvotes on an empty comment that I immediately edited when you explained it wasn't minimax! People will assume I said something naughty :) I'm not sure it was possible to read rage into my comment.
    - LoganDark 11 hours ago |parent
      > Fascinating the claim is Apple Intelligence doesn't work altogether. Quite a scandal.
      No, the claim is their particular device has a hardware defect that causes MLX not to work (which includes Apple Intelligence).
      > EDIT: If you wouldn't mind, could you edit out "AI rage enthusiast" you edited in? I understand it was in good humor, as you describe yourself that way as well. However, I don't want to eat downvotes on an empty comment that I immediately edited when you explained! People will assume I said something naughty :) I'm not sure it was possible to read rage into my comment.
      Your comment originally read:
      > This is blinkered.
      > - MiniMax can't fit on an iPhone.
      > - There's no reason to expect models to share OOMs for output.
      > - It is likely this is a graceful failure mode for the model being far too large.
      > No fan of Apple's NIH syndrome, or it manifested as MLX.
      > I'm also no fan of "I told the robot [vibecoded] to hammer a banana into an apple. [do something impossible]. The result is inedible. Let me post to HN with the title 'My thousand dollars of fruits can't be food' [the result I have has ~nothing to do with the fruits]"
      > Better to dig in a bit more.
      Rather than erase it, and invite exactly the kind of misreading you don't want, you can leave it... honestly, transparently... with your admission in the replies below. And it won't be downvoted as much as when you're trying to manipulate / make requests of others to try to minimize your downvotes. Weird... voting... manipulating... stuff, like that, tends to be frowned upon on HN.
      You have more HN karma than I do, even, so why care so much about downvotes...
      If you really want to disown something you consider a terrible mistake, you can email the HN mods to ask for the comment to be dissociated from your account. Then future downvotes won't affect your karma. I did this once.
      - fragmede 9 hours ago |parent
        Oh no, all my meaningless internet points, gone!
      - mikestew 9 hours ago |parent
        Then future downvotes won't affect your karma.
        Who cares? The max amount of karma loss is 4 points, we can afford to eat our downvotes like adults.
        LoganDark 8 hours ago |parent
        Huh. I thought the minimum comment score was -4 (which would make the maximum amount of karma loss 5, since each comment starts at 1 point), but I didn't know if that was a cap on karma loss or just a cap on comment score.
tehwebguy 11 hours ago
Here’s one that kills me:
- Tightening some bolts, listening to something via airpods
- Spec tells me torque in Nm
- Torque wrench is in ft lbs
- “Hey Siri, what’s X newton meters in foot pounds?”
- “Here’s some fucking website: ”
dav43 9 hours ago
My thousand dollar iPhone can't even add a contact from a business card.
vanviegen 12 hours ago
Perfect conclusion: my expensive and rather new phone is broken by design, so I just buy an even newer and more expensive one from the same vendor.
The heroic attempt at debugging this though makes me sympathize with all of those engineers that must be doing low-level LLM development these days and getting just noise out of their black boxes.
- ohyoutravel 12 hours ago |parent
  This is a vibe coded slop app.