Friendica Social Network

TurboQuant compresses LLM key-value caches down to 3 bits per value. 6× memory reduction, up to 8× faster attention, and no 0 degradation.

TurboQuant looks like a pretty massive deal for running local models efficiently. The core issue they are tackling is the memory bottleneck caused by the key value cache during generation. When you are doing long context inference storing all those high dimensional vectors eats up VRAM extremely fast. Traditional vector quantization helps but usually introduces memory overhead because you have to store scaling factors or constants in full precision for every small block of data. That overhead can easily add an extra bit or two per parameter which ruins the compression targets people are aiming for.

TurboQuant solves the problem by combining two clever mathematical tricks to eliminate that overhead entirely and get the cache down to 3 bits without losing accuracy. The first part is an algorithm called PolarQuant. Instead of looking at the vectors in standard cartesian coordinates it converts them into polar coordinates. This basically separates the magnitude from the direction. Because the angles map onto a fixed predictable circular grid the model no longer needs to store th

The second piece of the puzzle is where they use something called Quantized Johnson Lindenstrauss or QJL to clean up the residual error left over from the first step. QJL uses a mathematical transform to shrink that leftover error down to just a single sign bit of positive or negative one while preserving the relative distances between the data points. This acts as a mathematical error checker that fixes any bias in the attention scores. Because it only uses one bit and preserves the geometry of the space the attention mechanism can still calculate accurate logits without needing full precision data.

They tested this on open weights models like Gemma and Mistral across heavy needle in a haystack and LongBench tasks. They managed to compress the KV cache down to 3 bits with literally zero drop in accuracy and they did not even need to do any fine tuning or calibration. On top of saving a massive amount of VRAM the 4 bit version actually speeds up attention logit computation by up to 8x on H100 GPUs compared to standard 32 bit floats. This seems like a massive leap forward for anyone trying to run long context models on constrained hardware or scale up huge vector search databases.

TurboQuant: Redefining AI efficiency with extreme compression

^{research.google}

#technology

⇧

☆ Yσɠƚԋσʂ ☆ via Technology

☆ Yσɠƚԋσʂ ☆
6 days ago • •

TurboQuant compresses LLM key-value caches down to 3 bits per value. 6× memory reduction, up to 8× faster attention, and no 0 degradation.

TurboQuant: Redefining AI efficiency with extreme compression

☆ Yσɠƚԋσʂ ☆ via Technology

☆ Yσɠƚԋσʂ ☆ 6 days ago • •

TurboQuant compresses LLM key-value caches down to 3 bits per value. 6× memory reduction, up to 8× faster attention, and no 0 degradation.

TurboQuant: Redefining AI efficiency with extreme compression

☆ Yσɠƚԋσʂ ☆
6 days ago • •