HNNewShowAskJobs
Built with Tanstack Start
Continuous Autoregressive Language Models(arxiv.org)
102 points by Anon84 8 days ago | 9 comments
  • killerstorm11 hours ago

    Would be interesting to combine it with Reasoning In the Latent Space: feed the vector from the output layer of transformer back to input.

    Obviously, you can't do it in pre-training. But you can add it later as an optional 'extra' vector, I think. E.g. `input_embedding + MLP(prev_output) * alpha`. Alpha is zero during pre-training.

    • vessenes9 hours ago |parent

      I like this plan, but don't you already have this from the input vector in the prompt, at least if the inference is 'chunk wise' - generating a latent space vector, decoding it, outputting it, doing the next one.

      What if you trained a separate thinking phase using the auto encoder, though? Might be more efficient, and then you've got it using neuralese internally.

      Actually, reading the (summary) paper - they tried your idea and had trouble with it for a different reason:

         > Once the generative head predicts the next vector , a natural next step would be to feed it directly as input to the Transformer for predicting . However, we found that the model struggles to unpack the semantic information from such a compact representation. Instead, we ground the autoregressive process back in the more structured discrete space, where the predicted  is passed through the autoencoder to reconstruct the K tokens.
  • mike_hearn2 hours ago

    If they can reinvent RL so it works with this then I guess the big labs will be all over it, as ~halving inference costs would be huge (especially if Ed Zitron's leaked OpenAI inf costs are accurate). Potentially the difference between inferencing being profitable and loss making. It's an elegant approach.

    I also wonder how far they can push K if other aspects are tweaked. The approach of just doubling each parameter each time leaves a lot of space between the chosen value and the next value known to not work.

  • mentalgear12 hours ago

    Very interesting. Also I find these training parameters quite elegant:

    - Diversity: This term encourages the model to generate a diverse set of samples, preventing mode collapse. - Fidelity: This term rewards the model for making predictions that are close to the ground-truth

    I'm wondering if a continuos next-vector generative approach also increase innate "reasoning" capabilities of the model, since it could potentially capture more of the semantics of the data vs just tokens.

    • barrenko12 hours ago |parent

      And may be even more adapted to sorts of RL finetuning?

      • mike_hearn2 hours ago |parent

        They say this technique isn't compatible yet with RL because you can't adjust the logits. So no GRPO I guess, which is going to be the biggest issue. An LLM with no RL applied isn't going to be that useful.

  • vatsachak5 hours ago

    K being fixed here seems like it will eventually be done away with

    When I'm thinking about math proofs, sometimes I can have a single idea which can be unfolded into a hundred lines of proof

    Maybe I'm getting the wrong analogy here, but if vectors = ideas then K should depend on the vector

  • notrealyme1236 hours ago

    Congratulations for the authors, but damit, there goes a good idea ^^

  • suddenlybananas12 hours ago

    The technique of compressing tokens down reminds me a bit of byte latent transformers