Friendica Social Network

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

The hardware efficiency gains are honestly the most interesting part of the paper. The main reason DeepSeek-V4 is so cheap to run comes down to how they completely bypassed the quadratic cost of standard attention for massive context windows.

They built a hybrid attention architecture that interleaves Compressed Sparse Attention and Heavily Compressed Attention. Standard models keep every single token in the KV cache which absolutely kills memory. CSA fixes this by compressing the KV cache of multiple tokens into a single entry and then uses a sparse routing mechanism to only compute attention over the top-k most relevant compressed blocks. HCA takes it a step further by compressing an even larger number of tokens into one entry but computes dense attention over them. So, a 1.6T parameter Pro model only uses a third of the compute FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2 at a one million token context.

They also aggressively pushed low-precision formats applying FP4 quantization-aware training to the Mixture-of-Experts weights and the attention Qu

DeepSeek_V4.pdf · deepseek-ai/DeepSeek-V4-Pro at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

^{huggingface.co}

#technology

in reply to ☆ Yσɠƚԋσʂ ☆

monkeyslikebananas2

in reply to ☆ Yσɠƚԋσʂ ☆ • 3 weeks ago • •

865GB? I can’t run that locally. I want like 30 specialized 100GB models I can run locally. I can’t load/unload them as needed. Would take longer to do the inference but things have gotten good enough to set it and forget it.

in reply to monkeyslikebananas2

☆ Yσɠƚԋσʂ ☆

in reply to monkeyslikebananas2 • 3 weeks ago • •

It looks like you can run a low quant version on a 125gb machine, and apparently performance is still really good. github.com/makepad/llama_antir…

GitHub - makepad/llama_antirez_deepseek

Contribute to makepad/llama_antirez_deepseek development by creating an account on GitHub.

^GitHub

in reply to ☆ Yσɠƚԋσʂ ☆

monkeyslikebananas2

in reply to ☆ Yσɠƚԋσʂ ☆ • 3 weeks ago • •

Interesting 🤔

in reply to ☆ Yσɠƚԋσʂ ☆

sudoer777

in reply to ☆ Yσɠƚԋσʂ ☆ • 3 weeks ago • •

On OpenCode Go, Deepseek V4 Flash is crazy cheap, and a lot of people are saying they're getting good results from it. V4 Pro is said to be competitive with Kimi K2.6 and GLM 5.1, and its also a lot cheaper at least for now.

This entry was edited (3 weeks ago)

in reply to ☆ Yσɠƚԋσʂ ☆

partofthevoice

in reply to ☆ Yσɠƚԋσʂ ☆ • 3 weeks ago • •

Holy shit, I barely learned what the quadratic cost of attention was like 2 weeks ago. Can we hit the brakes a bit, before we start optimizing the shit out of everything? I am going to get lost in the layers of abstraction.

⇧

☆ Yσɠƚԋσʂ ☆ via Technology

☆ Yσɠƚԋσʂ ☆
3 weeks ago • •

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek_V4.pdf · deepseek-ai/DeepSeek-V4-Pro at main

monkeyslikebananas2

☆ Yσɠƚԋσʂ ☆

GitHub - makepad/llama_antirez_deepseek

monkeyslikebananas2

sudoer777

partofthevoice

☆ Yσɠƚԋσʂ ☆ via Technology

☆ Yσɠƚԋσʂ ☆ 3 weeks ago • •

☆ Yσɠƚԋσʂ ☆
3 weeks ago • •