Gemma 4 Secret MTP Discovery Sparks Developer Backlash
Why It Matters
The removal of performance-enhancing features suggests a growing trend of 'nerfing' open-weights models to maintain a gap between free and paid enterprise offerings. This impacts developers' ability to optimize on-device inference for mobile applications.
Key Points
- A developer discovered hidden Multi-Token Prediction (MTP) weights in Gemma 4 files via the LiteRT API.
- A Google employee allegedly confirmed MTP was disabled to ensure 'compatibility and broad usability' across different hardware.
- The community is exploring reverse-engineering the LiteRT compute graph to restore the disabled performance features.
Google has come under scrutiny after developers discovered that the Gemma 4 model architecture contains latent Multi-Token Prediction (MTP) weights that were disabled in the official release. The discovery was made by a developer using the LiteRT API on a Google Pixel 9 device, where tensor shape errors revealed the presence of MTP heads designed for speculative decoding. A Google representative reportedly confirmed that the feature was intentionally removed to ensure broad compatibility across various hardware environments. This revelation follows previous community disappointment regarding the unreleased Gemma 124B model. Technical experts are now discussing the possibility of reverse-engineering the LiteRT compute graph to reactivate these high-speed generation capabilities. Google has not yet issued a formal statement regarding whether a 'Pro' version of the weights with MTP enabled will be released to the public.
A developer digging into Google’s new Gemma 4 model found some hidden 'secret sauce' that makes it run much faster. It turns out Google included the hardware for 'Multi-Token Prediction'—which lets the AI guess multiple words at once—but they turned it off before giving it to the public. When the developer’s phone crashed trying to load the hidden files, a Google employee confirmed they disabled it on purpose to make sure it works on older devices. Now, the AI community is frustrated because they feel they're getting a slower version of what Google actually built.
Sides
Critics
Discovered the hidden weights and expressed frustration that Google 'nerfed' the model's speed.
Argues that Google is gatekeeping performance and wants full transparency regarding model capabilities.
Defenders
Maintains that removing the feature was a technical decision to ensure the model runs reliably on a wider range of consumer devices.
Noise Level
Forecast
Independent researchers will likely attempt to patch the Gemma 4 weights to re-enable MTP within the next few weeks. Google may face pressure to release an 'Experimental' or 'Turbo' branch of Gemma 4 that officially supports these faster inference methods.
Based on current signals. Events may develop differently.
Timeline
Google confirmation reported
The developer claims a Google employee confirmed the intentional removal of MTP for compatibility reasons on a Hugging Face discussion thread.
Hidden MTP weights discovered
Reddit user Electrical-Monitor27 reports finding MTP prediction heads while debugging LiteRT on a Pixel 9.
Join the Discussion
Discuss this story
Community comments coming in a future update
Be the first to share your perspective. Subscribe to comment.