What is a quantized model?

If you have ever seen AI models before installation, you have probably come across something like: “mistral-7b-instruct-v0.3.Q4_K_M.gguf”. The model name may ring a bell, we have already explained the meaning of the letter “b”, but what is that string of letters and numbers at the end?…

That is quantization. A simple but clever technique that allows a powerful AI model to run on an ordinary office computer, without any significant loss in quality.

The key idea in one sentence

AI models originally require a huge amount of memory. Quantization reduces that memory footprint by storing the model’s internal “weights” (its numerical values) using fewer bits; slightly less precise, but with acceptable quality loss.

Think of it like compressing a photo: you go from a 48-megapixel original to a 12-megapixel version. When printed, the difference is barely visible, but the file is four times smaller.

Why is this necessary?

A 7-billion-parameter model in its original form requires around 28 GB of memory. A standard office computer simply cannot run it. With Q4 quantization, that same model fits into ~4.5 GB and can run comfortably on a machine with only 8 GB of RAM.

That is why ArkeoAI uses Q4_K_M or Q5_K_M models by default: quality is sufficient for everyday professional tasks, and the hardware requirements remain realistic.

The formats and what they mean

The most common quantization levels, from best quality to most compressed:

Format	Size	Quality	RAM needed	Note
Q8_0	~7-8 GB	Excellent	12+ GB	Near-perfect quality
Q5_K_M	~5 GB	Very good	8 GB	Recommended ✓
Q4_K_M	~4.5 GB	Good	8 GB	Most widely used ✓
Q3_K_M	~3.5 GB	Fair	6 GB	Last resort only
Q2_K	~2.7 GB	Poor	4 GB	Not recommended

What does quantized data actually look like?

This is a legitimate question that rarely gets a straight answer. Here is a concrete example: imagine that one of the model’s internal values (a “weight”) originally holds this decimal number:

0.48291763

Quantization “simplifies” that value progressively and more aggressively depending on the compression level:

Format	Stored value	What it means
Original (FP32)	0.48291763	Full-precision decimal number
FP16 (16-bit)	0.4829	Slight rounding, barely noticeable
Q8 (8-bit)	123	Integer on a scale (e.g. 0-255)
Q4 (4-bit)	7	Integer on a scale (e.g. 0-15)
Q2 (2-bit)	2	Only 4 possible values (0-3)

Important: these values mean nothing in isolation. An AI model is made up of billions of such numbers that together form the model’s “knowledge”. A single number pulled out of context is meaningless, like a single letter taken from a book.

Can sensitive content be recovered from these values?

This is the question every privacy-conscious user asks and it is particularly important in the context of ArkeoAI.

The short answer: no.

The values stored in the quantized model (such as the 0.48… → 7 above) come from the model’s training process, not from your documents. Your files never enter the model; the model generalizes from texts seen during training, it does not copy them.

Your documents are stored in ArkeoAI in a separate database (the RAG system), which the model queries but never writes to. This database stays on the machine, offline, under your control.

In other words: the quantized model file (.gguf) contains nothing about your clients, your contracts, or your correspondence. That data stays on the computer, the model is simply a “tool” that gets queried.

Summary

Quantization is simply a compression technique: the AI’s internal values are stored with reduced precision so the model can run on more modest hardware. Quality loss is minimal for typical office tasks.

Q4_K_M and Q5_K_M: the best balance between quality and hardware requirements
A 7B model in Q4 weighs ~4.5 GB, compared to ~28 GB in its original form
Your documents do not enter the model and cannot be extracted from it
ArkeoAI runs offline: your data never leaves the machine