LIMITATIONS OF LLMs IN PRECISION IMAGING TASKS
Ask an LLM (Large Language Model), or simply AI (Artificial Intelligence) model, to create an image that involves mathematical calculations that will be represented in it – something like "create an assessment which involves mathematical calculations based on an image containing a graph or geometric diagram that shows the problem to be solved". That is, ask the model to create a text assessment and a symbolic image that supports it. See the answer. You may have an unpleasant surprise: the calculations you asked for may not be exactly the way you imagined them to be in this image. This has some probability, not negligible, of happening. If you ask for the same thing WITHOUT the image, the answer will probably be much more accurate. After several tests in this perspective, there was evidence of errors as discussed in this paragraph.
This can be treated as a type of limitation of the "mass" LLM models – general-purpose commercial models, accessible to the general public – in relation to the generation of this type of image (the one that can be described by symbolic specifications such as the one in the previous example), and which needs to show the calculations – or, at least, a numerical indication of them – that is in accordance with the respective text created. To generate this type of image, there is a problem more related to the integration of these solutions on a large scale than to the absence of technology – the technical solution is addressed in item 1 of the conclusions later in this text. In the case of images that need to recreate real-world scenes (shadows, inclined planes, etc.) with geometric/mathematical precision, unlike something more symbolic such as graphics, there is still a technical problem - not only of scale - unresolved.
What happens: There is a fundamental difference between generating an image via diffusion (or any "pixel-to-pixel"/"token-to-token" autoregressive process over a continuous visual space) and generating a symbolic specification that is then executed by a deterministic interpreter (SVG, LaTeX/TikZ, Python/Matplotlib scripts, Canvas API, etc.)1 . In other words, it is the difference between a probabilistic creation (in which representation failures may occur) and a deterministic creation (in which, once the symbolic specification is correct, its execution by the interpreter is faithful to what has been defined).
Conclusions:
1. The technical solution (Pipeline: LLM → Code → Interpreter Secure → Deterministic Image) already exists and works in a controlled environment for symbolic type, not for realistic images. However, this pipeline is not integrated into the standard commercial imaging workflows made available to billions of users due to computational cost, latency, and large-scale security complexity. Given this scenario, it is understandable that this type of error happens in pure diffusion models, as a result of this current limitation, whether technical or scale, but it is also important to emphasize that this is not the ideal situation – far from it.
2. To position ourselves in the face of this type of problem by debating it with assertive propositions, we need knowledge beyond the basics in the functioning of LLMs and, at least, the basics in Python (the latter to understand, in practice, through code, what is said in a textual and more general way about the concepts discussed here). Without this, it is difficult to participate in this type of discussion that directly impacts all users of these models.
[1] excerpt prepared with the support of Claude, Anthropic
Brian Penny image by Pixabay
References
Wang, J., Ming, Y., Shi, Z., Vineet, V., Wang, X., & Joshi, N. (2024). Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models. ArXiv, abs/2406.14852.
Bosheah, Z.; Bilicki, V. Challenges in Generating Accurate Text in Images: A Benchmark for Text-to-Image Models on Specialized Content. Appl. Sci. 2025, 15, 2274. https://doi.org/10.3390/app15052274
Zhang, C., Zhang, C., Zhang, M., & Kweon, I. (2023). Text-to-image Diffusion Models in Generative AI: A Survey. ArXiv, abs/2303.07909.
Kou, S., Jin, J., Zhou, Z., Ma, Y., Wang, Y., Chen, Q., Jiang, P., Yang, X., Zhu, J., Yu, K., & Deng, Z. (2026). Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders. ArXiv, abs/2601.10332.

Comentários
Postar um comentário