Google's powerful language models are now available on regular graphics cards

Google has released an update to its Gemma 3 language model, adapting it to run efficiently on consumer graphics cards thanks to Quantization-Aware Training technology. Now the largest models of the Gemma series can be used on regular personal computers and even laptops, which will make advanced AI more accessible to a wider range of users.

Tech Apr 21, 2025 0 54 Add to Reading List

Google's powerful language models are now available on regular graphics cards

Google has introduced an updated line of Gemma 3 models developed using Quantization-Aware Training (QAT), which allows powerful language models to run on standard NVIDIA RTX 3090-type graphics cards without significant loss of text generation quality. Previously, Gemma models demonstrated outstanding performance exclusively on high-end H100-class server accelerators using the BFloat16 (BF16) computation format. The new version allows for a dramatic reduction in video memory consumption, making the models convenient for local calculations.

QAT technology provides for the integration of the quantization stage directly into the network training process, minimizing the negative consequences of reducing the bit depth of weights and activations. This provides a significant reduction in the amount of RAM required to store the model, allowing for savings in hardware resources and an expansion of the range of potential users.

For example, the Gemma 3 with a capacity of 27 billion parameters (Gemma 3 27B) previously occupied about 54 GB of video memory in BF16 mode, whereas after quantization to the int4 format its size decreased to only 14.1 GB. The requirements of other representatives of the family have also decreased similarly: Gemma 3 12B (from 24 to 6.6 GB), Gemma 3 4B (from 8 to 2.6 GB) and Gemma 3 1B (from 2 to 0.5 GB).

Using low-bit representations reduces the load on the hardware and makes it possible to use powerful models on home computers. For example, Gemma 3 27B (int4) easily runs on the popular RTX 3090 video card with 24 GB of memory, Gemma 3 12B is available for owners of laptops with an RTX 4060 card (8 GB of video memory), and younger versions such as Gemma 3 4B and 1B are even suitable for mobile platforms.

The developers have ensured compatibility of the models with a number of tools and frameworks, including Ollama (for quick start via the command line), LM Studio (a graphical shell for loading and testing models), MLX (optimized inference on the Apple Silicon platform), as well as gemma.cpp and llama.cpp, which allow you to run models on central processors with support for the popular GGUF format.