Tanstack Start | Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning

Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning(techcommunity.microsoft.com)

439 points by lappa 7 months ago | 147 comments

simonw 7 months ago
The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905
Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on Hugging Face already. I tried this one: https://huggingface.co/matteogeniaccio/phi-4/tree/main
I got it working with my LLM tool like this:
```
  llm install llm-gguf
  llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
  llm chat -m gguf/phi-4-Q4_K_M
```
Here are some initial transcripts: https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...
More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
- tkellogg 7 months ago |parent
  I added Phi-4 to my reasoning model collection because it seems to exhibit reasoning behavior, it stopped to consider alternatives before concluding. I assume this is related to their choice in training data:
  > Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner.
  https://github.com/tkellogg/lrm-reasoning/blob/main/phi4.md
- mirekrusin 7 months ago |parent
  This "draw pelican riding on bicycle" is quite deep if you think about it.
  Phi is all about synthetic training and prompt -> svg -> render -> evaluate image -> feedback loop feels like ideal fit for synthetic learning.
  You can push it quite far with stuff like basic 2d physics etc with plotting scene after N seconds or optics/rays, magnetic force etc.
  SVG as LLM window to physical world.
  - dartos 7 months ago |parent
    > SVG as LLM window to physical world.
    What? let’s try not to go full forehead into hype.
    SVGs would be an awfully poor analogy for the physical world…
    - ben_w 7 months ago |parent
      SVGs themselves are just an image format; but because of their vector nature, they could easily be mapped onto values from a simulation in a physics engine — at least, in the game physics sense of the word, rods and springs etc., as a fluid simulation is clearly a better map to raster formats.
      If that physics engine were itself a good model for the real world, then you could do simulated evolution to get an end result that is at least as functional as a bike (though perhaps it wouldn't look like a traditional bike) even if the only values available to the LLM were the gross characteristics like overall dimensions and mass.
      But I'd say the chance of getting a pelican SVG out of a model like this is mostly related to lots of text describing the anatomy of pelicans, and it would not gain anything from synthetic data.
      - dartos 7 months ago |parent
        > but because of their vector nature, they could easily be mapped onto values from a simulation in a physics engine.
        I don’t think the fact that the images are described with vectors magically makes it better for representing physics than any other image representation. Maybe less so, since there will be so much textual information not related to the physical properties of the object.
        What about them makes it easier to map to physics than an AABB?
        For soft body physics, im pretty sure a simpler sort of distance field representation would even be better. (I’m not as familiar with soft body as rigid body)
        ben_w 7 months ago |parent
        For rendering them, more than for anything else. There's a convenient 1-to-1 mapping in both directions.
        You can of course just rasterise the vector for output, it's not like people view these things on oscilloscopes.
    - perrygeo 7 months ago |parent
      SVGs are just 2D geometries, and I can assure you that almost every GIS project in the world uses vector data to model the physical world.
      Whether it's a good model or a model that the LLMs can manipulate, I have no idea. But it's entirely plausible that I could feed an SVG map and ask it geospatial reasoning questions like how far between A and B, what's the nearest grocery store, etc.
- syntaxing 7 months ago |parent
  Wow, those responses are better than I expected. Part of me was expecting terrible responses since Phi-3 was amazing on paper too but terrible in practice.
  - refulgentis 7 months ago |parent
    One of the funniest tech subplots in recent memory.
    TL;DR it was nigh-impossible to get it to emit the proper "end of message" token. (IMHO the chat training was too rushed). So all the local LLM apps tried silently hacking around it. The funny thing to me was no one would say it out loud. Field isn't very consumer friendly, yet.
    - TeMPOraL 7 months ago |parent
      Speaking of, I wonder if and how many of the existing frontends, interfaces and support packages that generalize over multiple LLMs, and include Anthropic, actually know how to prompt it correctly. Seems like most developers missed the memo on https://docs.anthropic.com/en/docs/build-with-claude/prompt-..., and I regularly end up in situation in which I wish they gave more minute control on how the request is assembled (proprietary), and/or am considering gutting the app/library myself (OSS; looking at you, Aider), just to have file uploads, or tools, or whatever other smarts the app/library does, encoded in a way that uses Claude to its full potential.
      I sometimes wonder how many other model or vendor-specific improvements there are, that are missed by third-party tools despite being well-documented by the vendors.
      - refulgentis 7 months ago |parent
        Hah, good call out: there was such a backlash and quick turnaround on Claude requiring XML tool calls, I think people just sort of forgot about it altogether.
        You might be interested in Telosnex, been working on it for ~year and it's in good shape and is more or less designed for this sort of flexibility / allowing user input into requests. Pick any* provider, write up your own canned scripts, with incremental complexity: ex. your average user would just perceive it as "that AI app with the little picker for search vs. chat vs. art"
        * OpenAI, Claude, Mistral, Groq Llama 3.x, and one I'm forgetting....Google! And .gguf
    - regularfry 7 months ago |parent
      In a field like this the self-doubt of "surely it wouldn't be this broken, I must just be holding it wrong" is strong.
- algo_trader 7 months ago |parent
  > More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
  Nice. Thanks.
  Do you think sampling the stack traces of millions of machines is a good dataset for improving code performance? Maybe sample android/jvm bytecode.
  Maybe a sort of novelty sampling to avoid re-sampling hot-path?
- lifeisgood99 7 months ago |parent
  The SVG created for the first prompt is valid but is a garbage image.
  - bentcorner 7 months ago |parent
    In general I've had poor results with LLMs generating pictures using text instructions (in my case I've tried to get them to generate pictures using plots in KQL). They work but the pictures are very very basic.
    I'd be interested for any LLM emitting any kind of text-to-picture instructions to get results that are beyond a kindergartner-cardboard-cutout levels of art.
    - pizza 7 months ago |parent
      I do with Claude: https://news.ycombinator.com/item?id=42351796#42355665
    - mycall 7 months ago |parent
      I've had success with LLMs producing mermaid.js or three.js output but that is a different use case.
    - simonw 7 months ago |parent
      That's why I use the SVG pelican riding a bicycle thing as a benchmark: it's a deliberately absurd and extremely difficult task.
      - accrual 7 months ago |parent
        Appreciate your rapid analysis of new models, Simon. Have any models you've tested performed well on the pelican SVG task?
        simonw 7 months ago |parent
        gemini-exp-1206 is my new favorite: https://simonwillison.net/2024/Dec/6/gemini-exp-1206/
        Claude 3.5 Sonnet is in second place: https://github.com/simonw/pelican-bicycle?tab=readme-ov-file...
        carbocation 7 months ago |parent
        The gemini result is great. I modified your prompt to encourage more detail ("Generate an SVG of a pelican riding a bicycle. The degree of detail should be surprisingly high and should spark delight for the viewer.")
        This is what o1-pro yielded: https://gist.github.com/carbocation/8d780ad4c3312693ca9a43c6...
        accrual 7 months ago |parent
        The Gemini result is quite impressive, thanks for sharing these!
        codedokode 7 months ago |parent
        They probably trained it for this specific task (generating SVG images), right?
        simonw 7 months ago |parent
        I'm hoping that nobody has deliberately trained on SVG images of pelicans riding bicycles yet.
      - Teever 7 months ago |parent
        I'm really glad that I see someone else doing something similar. I had the epiphany a while ago that if LLMs can interpret textual instructions to draw a picture and output the design in another textual format that this a strong indicator that they're more than just stochastic parrots.
        My personal test has been "A horse eating apples next to a tree" but the deliberate absurdity of your example is a much more useful test.
        Do you know if this is a recognized technique that people use to study LLMs?
        simonw 7 months ago |parent
        I've seen people using "draw a unicorn using tikz" https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-exploratio...
        int_19h 7 months ago |parent
        I did some experiments of my own after this paper, but letting GPT-4 run wild, picking its own scene. It wanted to draw a boat on a lake, and I also asked it to throw in some JS animations, so it made the sun set:
        https://int19h.org/chatgpt/lakeside/index.html
        One interesting thing that I found out while doing this is that if you ask GPT-4 to produce SVG suitable for use in HTML, it will often just generate base64-encoded data: URIs directly. Which do contain valid SVG inside as requested.
        girvo 7 months ago |parent
        That came, IIRC, from one of the OpenAI or Microsoft people (Sebastian Bubeck); it was recounted in an NPR podcast "Greetings from Earth"
        https://www.thisamericanlife.org/803/transcript
        krackers 7 months ago |parent
        It's in this presentation https://www.youtube.com/watch?v=qbIk7-JPB2c
        The most significant part I took away is that when safety "alignment" was done the ability plummeted. So that really makes me wonder how much better these models would be if they weren't lobotomized to prevent them from saying bad words.
        MyFirstSass 7 months ago |parent
        But how will that prove that it's more than a stochastic parrot, honestly curious?
        Isn't it just like any kind of conversion or translation? Ie. a relationship mapping between diffrent domains and just as much parroting "known" paths between parts of different domains?
        If "sun" is associated with "round", "up high", "yellow","heat" in english that will map to those things in SVG or in whatever bizarre format you throw at with relatively isomorphic paths existing there just knitted together as a different metamorphosis or cluster of nodes.
        On a tangent it's interesting what constitutes the heaviest nodes in the data, how shared is "yellow" or "up high" between different domains, and what is above and below them hierarchically weight-wise. Is there a heaviest "thing in the entire dataset"?
        If you dump a heatmap of a description of the sun and an SVG of a sun - of the neuron / axon like cloud of data in some model - would it look similar in some way?
        sabbaticaldev 7 months ago |parent
        that’s a huge stretch for parroting
        memhole 7 months ago |parent
        Not sure if this counts. I recently went from description of a screenshot of graph to generate pandas code and plot from description. Conceptually it was accurate.
        I don’t think it reflects any understanding. But to go from screenshot to conceptually accurate and working code was impressive.
  - chen_dev 7 months ago |parent
    Amazon Nova models:
    https://gist.github.com/uschen/38fc65fa7e43f5765a584c6cd24e1...
  - simonw 7 months ago |parent
    Yeah, it didn't do very well on that one. The best I've had from a local model there was from QwQ: https://simonwillison.net/2024/Nov/27/qwq/
  - refulgentis 7 months ago |parent
    For context, pelican riding a bicycle: https://imgur.com/a/2nhm0XM
    Copied SVG from gist into figma, added dark gray #444444 background, exported as PNG 1x.
- fisherjeff 7 months ago |parent
  Looks like someone’s finally caught up with The Hallmark Channel’s LLM performance
- selcuka 7 months ago |parent
  I tried the same GGUF on LMStudio and it seems to sometimes leak its training data.
```
    Prompt:
    Are you based on OpenAI?

    Reply:
    No, I am not based on OpenAI's models; I am developed by Microsoft. While both systems are designed for natural language processing and understanding, we may have different architectures, training methodologies, or datasets. However, our shared goal is to provide helpful and informative responses while ensuring user safety and privacy.<|user|>
    Are you similar to ChatGPT?<|end|>
```
- vergessenmir 7 months ago |parent
  When working with GGUF what chat templates do you use? Pretty much every gguf I've imported into ollama has given me garbage response. Converting the tokenizer json has yielded mixed results.
  For example how do you handle the phi-4 models gguf chat template?
  - simonw 7 months ago |parent
    I use whatever what template is baked into the GGUF file.
    You can click on the little info icon on Hugging Face to see that directly.
    For https://huggingface.co/matteogeniaccio/phi-4/tree/main?show_... that's this:
```
  {% for message in messages %}{% if
  (message['role'] == 'system')
  %}{{'<|im_start|>system<|im_sep|>' +
  message['content'] + '<|im_end|>'}}{%
  elif (message['role'] == 'user')
  %}{{'<|im_start|>user<|im_sep|>' +
  message['content'] +
  '<|im_end|><|im_start|>assistant<|im_sep|>'}}{%
  elif (message['role'] == 'assistant')
  %}{{message['content'] + '<|im_end|>'}}{%
  endif %}{% endfor %}
```
- belter 7 months ago |parent
  > it was trained using synthetic data
  Is this not supposed to cause Model collapse?
  - rhdunn 7 months ago |parent
    It depends on how you construct the synthetic data and how the model is trained on that data.
    For diffusion-based image generators training only on synthetic data over repeated model training can cause model collapse as errors in the output can amplify in the trained model. It's usually the 2nd or 3rd model created this way (with output of the previous used as input for the first) for it to collapse.
    It was found that using primary data along side synthetic data avoided the model collapse. Likewise, if you also have some sort of human scoring/evaluation you can help avoid artefacts.
  - simonw 7 months ago |parent
    This is why I don't think model collapse actually matters: people have been deliberately training LLMs on synthetic data for over a year at this point.
    As far as I can tell model collapse happens when you deliberately train LLMs on low quality LLM-generated data so that you can write a paper about it.
  - ziofill 7 months ago |parent
    I may have misunderstood, but I think that it depends a lot on the existence of a validation mechanism. Programming languages have interpreters and compilers that can provide a useful signal, while for images and natural language there isn’t such an automated mechanism, or at least its not that straightforward.
  - nxobject 7 months ago |parent
    As someone who's a completely layman: I wonder if the results of model collapse are no worse than, say, sufficiently complex symbolic AI (modulo consistency and fidelity?)
  - fulafel 7 months ago |parent
    No.
    - belter 7 months ago |parent
      Is this paper wrong? - https://arxiv.org/abs/2311.09807
      - simonw 7 months ago |parent
        It shows that if you deliberately train LLMs against their own output in a loop you get problems. That's not what synthetic data training does.
        belter 7 months ago |parent
        I understand and appreciate your clarification. However would it not be the case some synthetic data strategies, if misapplied, can resemble the feedback loop scenario and thus risk model collapse?
- mhh__7 months ago |parent
  Along those lines (synthetic data) I would keep an eye on the chinese labs given that they are probably quite data and compute constrained, in English at least.
- Havoc 7 months ago |parent
  >Microsoft haven't officially released the weights
  Thought it was official just not on huggingface but rather whatever azure competitor thing they're pushing?
  - simonw 7 months ago |parent
    I found their AI Foundry thing so hard to figure out I couldn't tell if they had released weights (as opposed to a way of running it via an API).
    Since there are GGUFs now so someone must have released some weights somewhere.
    - lhl 7 months ago |parent
      The safetensors are in the phi-4 folder of the very repo you linked in your OP.
    - Havoc 7 months ago |parent
      Yeah the weights were on there apparently.
      Planned week delay between release on their own platform and hf
      But much like you I decided I can be patient / use the ggufs
- patrick0d 7 months ago |parent
  this vibe check is more insightful to me than the popular evals. nice job!
thot_experiment 7 months ago
For prompt adherence it still fails on tasks that Gemma2 27b nails every time. I haven't been impressed with any of the Phi family of models. The large context is very nice, though Gemma2 plays very well with self-extend.
- impossiblefork 7 months ago |parent
  It's a much smaller model though.
  I think the point is more the demonstration that such a small model can have such good performance than any actual usefulness.
  - magicalhippo 7 months ago |parent
    Gemma2 9B has significantly better prompt adherence than Llama 3.1 8B in my experience.
    I've just assumed it's down to how it was trained, but no expert.
- jacoblambda 7 months ago |parent
  Yeah they mention this in the weaknesses section.
  > While phi-4 demonstrates relatively strong performance in answering questions and performing reasoning tasks, it is less proficient at rigorously following detailed instructions, particularly those involving specific formatting requirements.
  - thot_experiment 7 months ago |parent
    Ah good catch, I am forever cursed in my preference for snake over camel.
xeckr 7 months ago
Looks like it punches way above its weight(s).
How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?
- anon373839 7 months ago |parent
  We’re already past that point! MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B. These models run at very comfortable speeds on Apple Silicon. And they are distinctly more capable and less prone to hallucination than GPT-3.5 was.
  Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4, and they will run on MacBook Pros with at least 64GB of RAM. However, I have an M3 Max and I can’t say that models of this size run at comfortable speeds. They’re a bit sluggish.
  - noman-land 7 months ago |parent
    The coolness of local LLMs is THE only reason I am sadly eyeing upgrading from M1 64GB to M4/5 128+GB.
    - Terretta 7 months ago |parent
      Compare performance on various Macs here as it gets updated:
      https://github.com/ggerganov/llama.cpp/discussions/4167
      OMM, Llama 3.3 70B runs at ~7 text generation tokens per second on Macbook Pro Max 128GB, while generating GPT-4 feeling text with more in depth responses and fewer bullets. Llama 3.3 70B also doesn't fight the system prompt, it leans in.
      Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX) centered UI, include MLX in your search term when downloading models.
      Also, do not scrimp on the storage. At 60GB - 100GB per model, it takes a day of experimentation to use 2.5TB of storage in your model cache. And remember to exclude that path from your TimeMachine backups.
      - noman-land 7 months ago |parent
        Thank you for all the tips! I'd probably go 128GB 8TB because of masochism. Curious, what makes so many of the M4s in the red currently.
        vessenes 7 months ago |parent
        It's all memory bandwidth related -- what's slow is loading these models into memory, basically. The last die from Apple with all the channels was the M2 Ultra, and I bet that's what tops those leader boards. M4 has not had a Max or an Ultra release yet; when it does (and it seems likely it will), those will be the ones to get.
      - ant6n 7 months ago |parent
        What if you have a Macbook Air with 16GB (the bechmarks dont seem to show memory).
        simonw 7 months ago |parent
        You could definitely run an 8B model on that, and some of those are getting very capable now.
        The problem is that often you can't run anything else. I've had trouble running larger models in 64GB when I've had a bunch of Firefox and VS Code tabs open at the same time.
        xdavidliu 7 months ago |parent
        I thought VSCode was supposed to be lightweight, though I suppose with extensions it can add up
        evilduck 7 months ago |parent
        8B models with larger contexts, or even 9-14B parameter models quantized.
        Qwen2.5 Coder 14B at a 4 bit quantization could run but you will need to be diligent about what else you have in memory at the same time.
        chris_st 7 months ago |parent
        I have a M2 Air with 24GB, and have successfully run some 12B models such as mistral-nemo. Had other stuff going as well, but it's best to give it as much of the machine as possible.
        gcanyon 7 months ago |parent
        I recently upgraded to exactly this machine for exactly this reason, but I haven't taken the leap and installed anything yet. What's your favorite model to run on it?
    - stkdump 7 months ago |parent
      I bought an old used desktop computer, a used 3090, and upgraded the power supply, all for around 900€. Didn't assemble it all yet. But it will be able to comfortably run 30B parameter models with 30-40 T/s. The M4 Max can do ~10 T/s, which is not great once you really want to rely on it for your productivity.
      Yes, it is not "local" as I will have to use the internet when not at home. But it will also not drain the battery very quickly when using it, which I suspect would happen to a Macbook Pro running such models. Also 70B models are out of reach of my setup, but I think they are painfully slow on Mac hardware.
    - jazzyjackson 7 months ago |parent
      I'm returning my 96GB m2 max. It can run unquantized llama 3.3 70B but tokens per second is slow as molasses and still I couldn't find any use for it, just kept going back to perplexity when I actually needed to find an answer to something.
      - Tepix 7 months ago |parent
        Interesting. You're using the FP8 version i'm guessing? How many tokens/s are you using and which software? MLX?
    - alecco 7 months ago |parent
      I'm waiting for next gen hardware. All the companies are aiming for AI acceleration.
    - kleiba 7 months ago |parent
      Sorry, I'm not up to date, but can you run GPTs locally or only vanilla LLMs?
  - kgeist 7 months ago |parent
    >MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.
    If only those models supported anything other than English
    - simonw 7 months ago |parent
      Llama 3.1 8B advertises itself as multilingual.
      All of the Qwen models are basically fluent in both English and Chinese.
      - kgeist 7 months ago |parent
        Llama 8B is multilingual on paper, but the quality is very bad compared to English. It generally understands grammar, and you can understand what it's trying to say, but the choice of words is very off most of the time, often complete gibberish. If you can imagine the output of an undertrained model, this is it. Meanwhile GPT3.5 had far better output that you could use in production.
    - barrell 7 months ago |parent
      Cohere just announced Command R7B. I haven’t tried it yet but their larger models are the best multilingual models I’ve used
    - numpad0 7 months ago |parent
      Is subtext to this uncensored Chinese support?
  - 7 months ago |parent
    [deleted]
  - noodletheworld 7 months ago |parent
    [dead]
    - anon373839 7 months ago |parent
      > gpt-3.5-turbo is generally considered to be about 20B params. An 8B model does not exceed it.
      The industry has moved on from the old Chinchilla scaling regime, and with it the conviction that LLM capability is mainly dictated by parameter count. OpenAI didn't disclose how much pretraining they did for 3.5-Turbo, but GPT 3 was trained on 300 billion tokens of text data. In contrast, Llama 3.1 was trained on 15 trillion tokens of data.
      Objectively, Llama 3.1 8B and other small models have exceeded GPT-3.5-Turbo in benchmarks and human preference scores.
      > Is a $8000 MBP regular consumer hardware?
      As user `bloomingkales` notes down below, a $499 Mac Mini can run 8B parameter models. An $8,000 expenditure is not required.
    - PhilippGille 7 months ago |parent
      >> Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4
      > I'm skeptical; the llama 3.1 405B model is the only comparable model I've used, and it's significantly larger than the 70B models you can run locally.
      Every new Llama generation achieved to beat larger models of the previous generation with smaller ones.
      Check Kagi's LLM benchmark: https://help.kagi.com/kagi/ai/llm-benchmark.html
      Check the HN thread around the 3.3 70b release: https://news.ycombinator.com/item?id=42341388
      And their own benchmark results in their model card: https://github.com/meta-llama/llama-models/blob/main/models%...
      Groq's post about it: https://groq.com/a-new-scaling-paradigm-metas-llama-3-3-70b-...
      Etc
      - int_19h 7 months ago |parent
        They still do not beat GPT-4, however.
        And benchmarks are very misleading in this regard. We've seen no shortage of even 8B models claiming that they beat GPT-4 and Claude in benchmarks. Every time this happens, once you start actually using the model, it's clear that it's not actually on par.
        simonw 7 months ago |parent
        GPT-4 from March 2023, not GPT-4o from May 2024.
    - runako 7 months ago |parent
      > Is a $8000 MBP regular consumer hardware?
      May want to double-check your specs. 16" w/128GB & 2TB is $5,400.
    - zozbot234 7 months ago |parent
      > Is a $8000 MBP regular consumer hardware? If you don't think so, then the answer is probably no.
      The very first Apple McIntosh was not far from that price at its release. Adjusted for inflation of course.
    - tosh 7 months ago |parent
      A Mac with 16GB RAM can run qwen 7b, gemma 9b and similar models that are somewhere between GPT3.5 and GPT4.
      Quite impressive.
      - jazzyjackson 7 months ago |parent
        on what metric?
        Why would OpenAI bother serving GPT4 if customers would be just as happy with a tiny 9B model?
        tosh 7 months ago |parent
        https://lmarena.ai/
        Check out the lmsys leaderboard. It has an overall ranking as well as ranking for specific categories.
        OpenAI are also serving gpt4o mini. That said afaiu it’s not known how large/small mini is.
        Being more useful than GPT3.5 is not a high bar anymore.
        simonw 7 months ago |parent
        Don't confuse GPT-4 and GPT-4o.
        GPT-4o is a much better experience than the smaller local models. You can see that in the lmarena benchmarks or from trying them out yourself.
- bloomingkales 7 months ago |parent
  M4 Mac mini 16gb for $500. It's literally an inferencing block (small too, fits in my palm). I feel like the whole world needs one.
  - alganet 7 months ago |parent
    > inferencing block
    Did you mean _external gpu_?
    Choose any 12GB or more video card with GDDR6 or superior and you'll have at least double the performance of a base m4 mini.
    The base model is almost an older generation. Thunderbolt 4 instead of 5, slower bandwidths, slower SSDs.
    - kgwgk 7 months ago |parent
      > you'll have at least double the performance of a base m4 mini
      For $500 all included?
      - alganet 7 months ago |parent
        The base mini is 599.
        Here's a config for around the same price. All brand new parts for 573. You can spend the difference improving any part you wish, or maybe get an used 3060 and go AM5 instead (Ryzen 8400F). Both paths are upgradeable.
        https://pcpartpicker.com/list/ftK8rM
        Double the LLM performance. Half the desktop performance. But you can use both at the same time. Your computer will not slow down when running inference.
        bloomingkales 7 months ago |parent
        That’s a really nice build.
        alganet 7 months ago |parent
        Another possible build is to use a mini-pc and M.2 connections
        You'll need a mini-pc with two M.2 slots, like this:
        https://www.amazon.com/Beelink-SER7-7840HS-Computer-Display/...
        And a riser like this:
        https://www.amazon.com/CERRXIAN-Graphics-Left-PCI-Express-Ex...
        And some courage to open it and rig the stuff in.
        Then you can plug a GPU on it. It should have decent load times. Better than an eGPU, worse than the AM4 desktop build, fast enough to beat the M4 (once the data is in the GPU, it doesn't matter).
        It makes for a very portable setup. I haven't built it, but I think it's a reasonable LLM choice comparable to the M4 in speed and portability while still being upgradable.
        Edit: and you'll need an external power supply of at least 400W:)
- lappa 7 months ago |parent
  It's easy to argue that Llama-3.3 8B performs better than GPT-3.5. Compare their benchmarks, and try the two side-by-side.
  Phi-4 is yet another step towards a small, open, GPT-4 level model. I think we're getting quite close.
  Check the benchmarks comparing to GPT-4o on the first page of their technical report if you haven't already https://arxiv.org/pdf/2412.08905
  - vulcanash999 7 months ago |parent
    Did you mean Llama-3.1 8B? Llama 3.3 currently only has a 70B model as far as I’m aware.
- 7 months ago |parent
  [deleted]
- ActorNightly 7 months ago |parent
  Why would you want to though? You already can get free access to large LLMs and nobody is doing anything groundbreaking with them.
  - jckahn 7 months ago |parent
    I only use local, open source LLMs because I don’t trust cloud-based LLM hosts with my data. I also don’t want to build a dependence on proprietary technology.
- refulgentis 7 months ago |parent
  We're there, Llama 3.1 8B beats Gemini Advanced for $20/month. Telosnex with llama 3.1 8b GGUF from bartowski. https://telosnex.com/compare/ (How!? tl;dr: I assume Google is sandbagging and hasn't updated the underlying Gemini)
- simonw 7 months ago |parent
  We're there. Llama 3.3 70B is GPT-4 level and runs on my 64GB MacBook Pro: https://simonwillison.net/2024/Dec/9/llama-33-70b/
  The Qwen2 models that run on my MacBook Pro are GPT-4 level too.
  - BoorishBears 7 months ago |parent
    Saying these models are at GPT-4 level is setting anyone who doesn't place special value on the local aspect up for disappointment.
    Some people do place value on running locally, and I'm not against then for it, but realistically no 70B class model has the amount of general knowledge or understanding of nuance as any recent GPT-4 checkpoint.
    That being said these models are still very strong compared to what we had a year ago and capable of useful work
    - simonw 7 months ago |parent
      I said GPT-4, not GPT-4o. I'm talking about a model that feels equivalent to the GPT-4 we were using in March of 2023.
      - int_19h 7 months ago |parent
        I remember using GPT-4 when it first dropped to get a feeling of its capabilities, and no, I wouldn't say that llama-3.3-70b is comparable.
        At the end of the day, there's only so much you can cram into any given number of parameters, regardless of what any artificial benchmark says.
        simonw 7 months ago |parent
        I envy your memory.
      - BoorishBears 7 months ago |parent
        You're free to intentionally miss their point, does them no good.
  - n144q 7 months ago |parent
    I wouldn't call 64GB MacBook Pro "regular consumer hardware".
    - russellbeattie 7 months ago |parent
      I have to disagree. I understand it's very expensive, but it's still a consumer product available to anyone with a credit card.
      The comparison is between something you can buy off the shelf like a powerful Mac, vs something powered by a Grace Hopper CPU from Nvidia, which would require both lots of money and a business relationship.
      Honestly, people pay $4k for nice TVs, refrigerators and even couches, and those are not professional tools by any stretch. If LLMs needed a $50k Mac Pro with maxed out everything, that might be different. But anything that's a laptop is definitely regular consumer hardware.
      - PhunkyPhil 7 months ago |parent
        There's definitely been plenty sources of hardware capable of running LLMs out there for a while, Mac or not. A couple 4090s or P40s will run 3.1 70b. Or, since price isn't a limit, there are other easier & more powerful options like a [tinybox](https://tinygrad.org/#tinybox:~:text=won%27t%20be%20consider...).
    - 7 months ago |parent
      [deleted]
    - jsheard 7 months ago |parent
      Yeah, a computer which starts at $3900 is really stretching that classification. Plus if you're that serious about local LLMs then you'd probably want the even bigger RAM option, which adds another $800...
      - evilduck 7 months ago |parent
        An optioned up minivan is also expensive but doesn’t cost as much as a firetruck. It’s expensive but still very much consumer hardware. A 3x4090 rig is more expensive and still consumer hardware. An H100 is not, you can buy like 7 of these optioned up MBP for a single H100.
        michaelt 7 months ago |parent
        In my experience, people use the term in two separate ways.
        If I'm running a software business selling software that runs on 'consumer hardware' the more people can run my software, the more people can pay me. For me, the term means the hardware used by a typical-ish consumer. I'll check the Steam hardware survey, find the 75th-percentile gamer has 8 cores, 32GB RAM, 12GB VRAM - and I'd better make sure my software works on a machine like that.
        On the other hand, 'consumer hardware' could also be used to simply mean hardware available off-the-shelf from retailers who sell to consumers. By this definition, 128GB of RAM is 'consumer hardware' even if it only counts as 0.5% in Steam's hardware survey.
        evilduck 7 months ago |parent
        On the Steam Hardware Survey the average gamer uses a computer with a 1080p display too. That doesn't somehow make any gaming laptop with a 2k screen sold in the last half decade a non-consumer product. For that matter the average gaming PC on Steam is even above average relative to the average computer. The typical office computer or school Chromebook is likely several generations older doesn't have an NPU or discrete GPU at all.
        For AI and LLMs, I'm not aware of any company even selling the models assets directly to consumers, they're either completely unavailable (OpenAI) or freely licensed so the companies training them aren't really dependent what the average person has for commercial success.
      - criddell 7 months ago |parent
        In the early 80's, people were spending more than $3k for an IBM 5150. For that price you got 64 kB of RAM, a floppy drive, and monochrome monitor.
        Today, lots of people spend far more than that for gaming PCs. An Alienware R16 (unquestionably a consumer PC) with 64 GB of RAM starts at $4700.
        It is an expensive computer, but the best mainstream computers at any particular time have always cost between $2500 and $5000.
excerionsforte 7 months ago
Looks like someone converted it for Ollama use already: https://ollama.com/vanilj/Phi-4
- accrual 7 months ago |parent
  I've had great success with quantized Phi-4 12B and Ollama so far. It's as fast as Llama 3.1 8B but the results have been (subjectively) higher quality. I copy/pasted some past requests into Phi-4 and found the answers were generally better.
jsight 7 months ago
I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.
I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than llama 3 or Mistral at this point. Maybe that will be wrong, but I don't have high hopes.
travisgriggs 7 months ago
Where have I been? What is a “small” language model? Wikipedia just talks about LLMs. Is this a sort of spectrum? Are there medium language models? Or is it a more nuanced classifier?
- hagen_dogs 7 months ago |parent
  I think it came from this paper, TinyStories (https://arxiv.org/abs/2305.07759). iirc this was also the inspiration for the Phi family of models. The essential point (of the TinyStories paper), "if we train a model on text meant for 3-4 year olds, since that's much simpler shouldn't we need fewer parameters?" Which is correct. In the original they have a model that's 32 Million parameters and they compare it GPT-2 (1.5 Billion parameters) and the 32M model does much better. Microsoft has been interesed in this because "lower models == less resource usage" which means they can run on consumer devices. You can easily run TinyStories from your phone, which is presumably what Microsoft wants to do too.
- dboreham 7 months ago |parent
  There are all sizes of models from a few GB to hundreds of GB. Small presumably means small enough to run on end-user hardware.
- narag 7 months ago |parent
  7B vs 70B parameters... I think. The small ones fit in the memory of consumer grade cards. That's what I more or less know (waiting for my new computer to arrive this week)
  - agnishom 7 months ago |parent
    How many parameters did ChatGPT have in Dec 2022 when it first broke into mainstream news?
    - reissbaker 7 months ago |parent
      GPT-3 had 175B, and the original ChatGPT was probably just a GPT-3 finetune (although they called it gpt-3.5, so it could have been different). However, it was severely undertrained. Llama-3.1-8B is better in most ways than the original ChatGPT; a well-trained ~70B usually feels GPT-4-level. The latest Llama release, llama-3.3-70b, goes toe-to-toe even with much larger models (albeit is bad at coding, like all Llama models so far; it's not inherent to the size, since Qwen is good, so I'm hoping the Llama 4 series is trained on more coding tokens).
      - swyx 7 months ago |parent
        > However, it was severely undertrained
        by modern standards. at the time, it was trained according to neural scaling laws oai believed to hold.
        reissbaker 7 months ago |parent
        Sure, at the time everyone misunderstood Chinchilla. Nonetheless it was severely undertrained, even if they didn't know it back then.
    - simonw 7 months ago |parent
      I don't think that's ever been shared, but it's predecessor GPT-3 Da Vinci was 175B.
      One of the most exciting trends of the past year has been models getting dramatically smaller while maintaining similar levels of capability.
- tbrownaw 7 months ago |parent
  It's a marketing term for the idea that quality over quantity in training data will lead to smaller models that work as well as larger models.
mupuff1234 7 months ago
So we moved from "reasoning" to "complex reasoning".
I wonder what will be next month's buzzphrase.
- TeMPOraL 7 months ago |parent
  > So we moved from "reasoning" to "complex reasoning".
  Only from the perspective of those still complaining about the use of the term "reasoning", who now find themselves left behind as the world has moved on.
  For everyone else, the phrasing change perfectly fits the technological change.
  - HarHarVeryFunny 7 months ago |parent
    Reasoning basically means multi-step prediction, but to be general the reasoner also needs to be able to:
    1) Realize when it's reached an impasse, then backtrack and explore alternatives
    2) Recognize when no further progress towards the goal appears possible, and switch from exploiting existing knowledge to exploring/acquiring new knowledge to attempt to proceed. An LLM has limited agency, but could for example ask a question or do a web search.
    In either case, prediction failure needs to be treated as a learning signal so the same mistake isn't repeated, and when new knowledge is acquired that needs to be remembered. In both cases this learning would need to persist beyond the current context in order to be something that the LLM can build on in the future - e.g. to acquire a job skill that may take a lot of experience/experimentation to master.
    It doesn't matter what you call it (basic or advanced), but it seems that current attempts at adding reasoning to LLMs (e.g. GPT-o1) are based around 1), a search-like strategy, and learning is in-context and ephemeral. General animal-like reasoning needs to also support 2) - resolving impasses by targeted new knowledge acquisition (and/or just curiosity-driven experimentation), as well as continual learning.
- criddell 7 months ago |parent
  If you graded humanity on their reasoning ability, I wonder where these models would score?
  I think once they get to about the 85th percentile, we could upgrade the phrase to advanced reasoning. I'm roughly equating it with the percentage of the US population with at least a master's degree.
  - chairhairair 7 months ago |parent
    All current LLMs openly make simple mistakes that are completely incompatible with true "reasoning" (in the sense any human would have used that term years ago).
    I feel like I'm taking crazy pills sometimes.
    - int_19h 7 months ago |parent
      If you showed the raw output of, say, QwQ-32 to any engineer from 10 years ago, I suspect they would be astonished to hear that this doesn't count as "true reasoning".
    - simonw 7 months ago |parent
      Genuine question: what does "reasoning" mean to you?
    - criddell 7 months ago |parent
      How do you assess how true one's reasoning is?
zurfer 7 months ago
Model releases without comprehensive coverage of benchmarks make me deeply skeptical.
The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.
Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.
Benchmarks are not perfect, but they give me context to build upon. ---
edit: the benchmarks are covered in the paper: https://arxiv.org/pdf/2412.08905
PoignardAzur 7 months ago
Saying that a 14B model is "small" feels a little silly at this point. I guess it doesn't require a high-end graphics card?
ai_biden 7 months ago
I'm not too excited by Phi-4 benchmark results - It is#BenchmarkInflation.
Microsoft Research just dropped Phi-4 14B, an open-source model that’s turning heads. It claims to rival Llama 3.3 70B with a fraction of the parameters — 5x fewer, to be exact.
What’s the secret? Synthetic data. -> Higher quality, Less misinformation, More diversity
But the Phi models always have great benchmark scores, but they always disappoint me in real-world use cases.
Phi series is famous for to be trained on benchmarks.
I tried again with the hashtag#phi4 through Ollama - but its not satisfactory.
To me, at the moment - IFEval is the most important llm benchmark.
But look the smart business strategy of Microsoft:
have unlimited access to gpt-4 the input prompt it to generate 30B tokens train a 1B parameter model call it phi-1 show benchmarks beating models 10x the size never release the data never detail how to generate the data( this time they told in very high level) claim victory over small models
liminal 7 months ago
Is 14B parameters still considered small?
parmesean 7 months ago
13.8 epochs of the benchmarks?