HNPWA with Next.js

Show HN: Speeding up LLM inference 2x times (possibly)

kolinko | 419 points

Essentially the algorithm is about pruning parameters on the fly and making the weight matrix sparse by setting least significant weights to zero, where "least significant" is defined in terms of the rank of the absolute value of the pruned weight within its group?

A search for model pruning turns out many results, including https://arxiv.org/abs/2305.11627 which discusses "magnitude pruning" as a baseline, and refers to https://arxiv.org/pdf/2301.00774.pdf, which asserts in the introduction:

> First, as shown in Figure 1, SparseGPT can induce uniform layerwise sparsity of up to 60% in e.g. the 175-billion-parameter variant of the OPT family (Zhang et al., 2022), with minor accuracy loss. By contrast, the only known one-shot baseline which easily extends to this scale, Magnitude Pruning (Hagiwara, 1994; Han et al., 2015), preserves accuracy only until 10% sparsity, and completely collapses beyond 30% sparsity.

I don't like how these papers boast their own methods by poorly implementing a baseline and describing their own methods using lots of mathematical jargon - the OP's blog post is a breeze in making the methods accessible to people with very little background.

renonce | 16 days ago

I love this line in the gpu implementation section.

"Readers fresh to GPU programming may ask now - how does it work?

Readers experienced with GPU programming may ask - how the hell does it work?"

spencerchubb | 16 days ago

“““ So instead, let's flip the matrix, sort the elements row-wise, and revisit the multiplications from that direction.

This is called a Compressed Sparse Row (CSR) format by the smart people. To do the multiplication now, we take, say, the 1 from the vector, multiply it by 256, and add it into the output vector at the 3rd row. And so on.

Now, let's see what happens if we truncate the last column - the one with the lowest values. ”””

How does csr works with reduced numbers multiplication?

bigcat12345678 | 16 days ago

This seems similar to semi-structured (aka 2:4) sparsity, may be worth explicitly comparing. As far as I can tell by skimming, your technique:

- is optimized for apple silicon - ~2x speed at 75% sparsity - dynamic, depends on input, applied at runtime - can choose amount of sparsity

And 2:4 semi-structured sparsity:

- is optimized for GPUs with sparse tensor cores (nvidia ampere and beyond) - ~2x speed at 50% sparsity - static, applied to the model at rest - probably worse results than your technique at 50% sparsity

The interesting comparison I'd want to see is semi-structured sparsity results (50% sparsity, 2x speedup) vs your results at 75% sparsity (2x speedup).

hatthew | 16 days ago

Having used CSR it's not surprising, and some newer formats might have more mechanical sympathy like block ELL, since they avoid uncoalesced reads / gathers, tho the code is trickier.

marmaduke | 16 days ago

Nice idea and write-up! I'm also working in the field of sparsity for neural network inference. Two things come to mind that you should be aware of:

- Compared with a dense Matrix-Vector Multiply implementation, your algorithm adds algorithmic complexity, but reduces memory traffic. Because Matrix-Vector is usually memory-bound, your throughput increases when you can reduce memory accesses. For batch sizes > 1, the speedup would likely disappear very quickly because you are no longer bottle-necked by memory accesses. (Assuming that's a scenario you are interested in)

- As a comparison, I would also like to see another model with an architecture that is 2x faster (so if you apply your method to a 13B parameters LLM at 50% sparsity, how does its performance fare against a 7B parameter LLM? How does it compare to the same LLM quantized to half the baseline bitwidth?). If you could show that your sparse model produces higher fidelity output in the same amount of time (compared to an established inference framework), I'm sure that that would make for an interesting publication.

- Because you are omitting multications, your approximation error will likely be biased to have always a smaller absolute value than the true result. If you can add a correction term to compensate for that systematic error, this likely gives you slightly better performance.

jadeaffenjaeger | 15 days ago

That looks like fantastic stuff. I just want to point out the 15ms delay looks similar to 60Hz vsync (16.7ms), if you are updating the screen once per token, maybe that’s causing a sync somehow?

globally_unique | 16 days ago

Thank you for this really cool and open contribution!

I will be watching llama.cpp closely for them to implement this!

I’ve been looking for ways to speed up CPU inference and I really like this idea of “effort”

dartos | 16 days ago

Nice writeup! Very curious about the performance per VRAM with this compared to just quantization. Any plans to implement a cross platform version?

gsuuon | 16 days ago

I know this doesn't retrain, but I wonder if approaches like this plus quantization could get back any "lost" quality with some post training.

It's great to see, and to know in one's mind how much likely performance and cost will be better in the future.

I know it's fun to work on, but I also want to say THANK YOU for developing open source.

yousif_123123 | 16 days ago

toisanji | 16 days ago

Could you break down a bit more about why you can skip so many calculations ?

kanwisher | 16 days ago

I'm wondering if you could sort the two inputs, add the indicies for the multiplications together, then take the largest of that.

In your 0.1 example, 1000 gets index 2, and 0.1 index 0, combines to 2. This will tie with the 1*8, but I think it would even out with larger vector lengths.

Edit: I could be wrong but I think you can precompute the indices for the weights in advance without a prompt, then you won't need to perform those sorts at runtime.

bick_nyers | 16 days ago

I think this is the overall idea behind the MoE LLM models right? MoE just expands upon this idea by learning which sets of weights to use

byyoung3 | 16 days ago

[deleted]

| 16 days ago

Could you explain the pseudo code in your equations page? Is the second approxMul call a typo (also the capitalized V)?

def precompute(W): W = W.T probes = get_probes(W) W_idx, W_val = sortMatrixRows(W)

def approxMul(v, W_idx, W_val, probes): cutoff_chart = v * probes cutoff = topK(cutoff_chart, effort) approxMul(V, W_idx, W_val, cutoff)

gcy | 16 days ago

I've been trying to think about how you'd amp up the batch size here. it's a bit tricky since the memory access would be way higher, but I think you can actually still save on compute by chunking things up in a clever way to utilize the tensorcores

brrrrrm | 16 days ago

Looks like the models are missing on Huggingface, I've logged an issue: https://github.com/kolinko/effort/issues/3

smcleod | 16 days ago

we need this into llama.cpp it seems somewhat stable down to 40% effort

avereveard | 16 days ago

I was wondering if something similar can be done for CNN too.

wayoverthecloud | 16 days ago

Also being discussed at: https://news.ycombinator.com/item?id=40067489

saurik | 16 days ago

Okay now add a small model that decides how much effort is needed in each inference step and we are good to go

uhlo | 16 days ago

how does it compare to 8-bit/4-bit quantization in terms of speed/accuracy?

coolvision | 16 days ago

If it sounds too good to be true, it probably is not.

If removing weights improves some metrics, that may be a clue that the model is not optimal in some sense.

a2code | 16 days ago

Name idea: lobotomize

thelastparadise | 15 days ago

do you have a simple python impl? :)

brrrrrm | 16 days ago

My takeaway is that this proves what was said on the recent latent.space podcast with David Luan from Adept.

https://www.latent.space/p/adept

"I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models."

In other words, if you train your own models, you will not get to take advantage of breakthroughs like this that start with open models (like Mistral).

All the advantages are going towards the open models and this is an existential risk for OpenAI and other closed model companies.

xrd | 16 days ago

Here to answer questions!

A friend of mine has published the original link on HN before me, so I hope a double post won't be an issue :)

kolinko | 16 days ago

Please, please, please call the final product Halfwit.

Seriously though, this is a very interesting biologically inspired idea, since not all neuronal pathways fire all the time.

It seems to follow that, if you can predict which weights you won't need, then you should be able to compress the model architecture permanently, at least for certain use cases.

queuebert | 16 days ago

[dead]

Samuel_w | 16 days ago

[deleted]

| 16 days ago

[dead]

John_da | 16 days ago