Misplaced Expectations and the Limits of LLMs

Over the past three years, LLMs have gained enormous popularity and were brought into widespread public awareness largely through the efforts of OpenAI. Large Language Models are an impressive and highly practical technology. In many everyday situations, they help simplify and accelerate work. Personally, I use LLMs as a kind of “next generation search,” partially replacing traditional Google searches.

During the last three years, many organisations and companies have begun searching for ever broader application areas. In some cases, there are ambitions—and in others even concrete plans—that AI, and LLMs in particular, could autonomously handle entire business processes. The expectation is that this could make a significant portion of the workforce redundant and thereby reduce costs, for example through the use of LLM-based agents. Companies are currently investing enormous amounts of money in AI technology, particularly in systems based on LLMs. I am sceptical of this vision and do not believe that current LLM technology is a reliable means to achieve the expectations. In the following sections, I outline the reasoning behind this view.

One of the central arguments against using LLMs to automatically and autonomously handle business processes is the error rate and the inaccuracy of the information they produce. Based on my personal experience, the everyday use of an LLM often results in error rates of roughly 30 to 60 percent, depending on the application domain and how prompts are formulated. These errors range from minor inaccuracies to clearly incorrect statements. Various studies also show that errors in this range can occur in practice ¹-².

This error rate is therefore not merely the result of poorly written prompts. It also follows from technical limitations inherent to LLMs themselves. The following arguments lead me to conclude that LLMs are currently not suitable for performing autonomous operational tasks.

Learned Knowledge Is Limited, Unplanned Situations Are Not

An LLM does not possess a reliable world model from which it can safely derive rules. Instead, it generates text that statistically fits the given input. As long as a question appears in a form similar to what was present in the training data or the current context, the results can appear remarkably convincing ³. However, when a situation is genuinely new or when important details are missing, the system is left with little more than producing a “plausible” guess.

This is precisely the point that cannot be ignored in business processes. In accounting, medicine, law, compliance, risk management, or operational workflows, “plausible” is not sufficient. A single serious error can cause more damage than a hundred correct answers provide value. For this reason, even a 5 percent error rate is often unacceptable, and 20 percent becomes practically unusable.

Context Size Forces Fragmentation, and Transitions Create Errors

The second major limitation is the context size ⁴-⁵. As soon as a task exceeds the available context window, the only practical option is to split it. Documents are broken into sections, tasks are divided into subtasks, intermediate results are summarized, and these summaries are later fed back as input.

This process introduces several typical error sources:

Global dependencies are lost, such as definitions, exceptions, cross-references, or boundary conditions.
Results from context window A and context window B can only be combined with limited consistency.
Inaccuracies often occur precisely at the transition points, where the overall logic of the problem would be required.

If an LLM “reads” a 500-page documentation in ten separate parts, it lacks the relationships across the document as a whole. One can attempt to compensate for this by creating summaries, but summaries inevitably lose detail. Alternatively, one can include more and more text in the context, but this quickly becomes expensive and slow.

“Just Use a Larger Context Window” Is Not a Working Solution

A common suggestion is to massively increase the context window so that large tasks can be processed in a single pass. This can help in certain situations, but it is not a simple scaling effect in the sense of “twice the context equals half the errors.”

There are two main reasons for this.

a) With very long contexts, information in the middle becomes less usable

LLMs often make better use of information located at the beginning or the end of a long context than information placed somewhere in the middle ⁶. In practice this means that even if all relevant information is technically present in the context, it does not guarantee that the model will reliably use it. More context therefore does not automatically translate into fewer errors.

b) Resource requirements increase sharply and can constrain operations

Long contexts require very large amounts of memory, particularly for the KV cache, and they reduce throughput ⁷. The time to first output increases because the model must first process a very large input sequence. At very large context sizes, a practical issue emerges: a single request can occupy so many resources that the remaining infrastructure can barely operate in parallel ⁸. This is not only expensive but also difficult to scale.

As a result, a fundamental trade-off emerges. Some transition errors may be reduced, but this comes at the cost of significantly higher operating expenses, higher latency, and no guaranteed improvement in reliability.

Even With More Context and Training, a Residual Error Rate Remains

My expectation is that even with substantial effort, a residual error rate of around 20 percent may remain. This is not because 20 percent is a special number, but because the underlying principle does not change: the system produces probabilistic outputs, not deterministic proofs. This becomes particularly visible when the model encounters situations that are not covered by its training data or the provided context. In those cases, the system lacks a reliable basis and the output becomes closer to an informed guess than a well-grounded conclusion.

This leads to the economically critical point. Many companies do not need systems that are “usually correct”; they need systems that are “almost always correct.” Moving from 80 percent reliability to 99 percent reliability is not simply a matter of adding more hardware. In practice, reducing the remaining errors becomes increasingly expensive the closer one moves toward very high reliability. The remaining errors typically involve the most difficult cases: exceptions, ambiguous situations, novel scenarios, contradictory information, or rare boundary conditions.

Costs Rise Rapidly While Benefits Do Not Scale at the Same Rate

This leads to the core economic argument. From today’s perspective, many of the current investments appear excessive because the required computational resources often cannot or will not be provided at the necessary scale in real operational environments. At the same time, even very high resource investments do not reduce the error rate to levels that many business problems require.

This pattern becomes visible when looking at approximate cost estimates. The following figures represent rough estimates of the Total Cost of Ownership (TCO) associated with different context window sizes over the lifetime of the required hardware. The calculation method is described in the appendix (values in USD).

128,000 token context: TCO roughly $0.56 million per hardware cycle
1.2 million token context: TCO roughly $2.23 million
3.6 million token context: TCO roughly $4.64 million
7.2 million token context: TCO roughly $9.82 million

The problem is not only the absolute cost but also the efficiency. As the context window grows, a single request consumes increasingly large amounts of memory and inter-process communication bandwidth. As a result, the parallel throughput per invested dollar declines significantly. When throughput declines, the cost per usable response increases.

This leads to a clear conclusion: even if error rates decrease somewhat with additional resources, they remain relatively high while costs grow very rapidly. The resulting increase in cost can no longer be justified by the incremental gains in performance.

“RAG Will Fix It” and “Specialization Will Fix It” Are Only Partial Answers

Two common counterarguments deserve serious consideration. However, neither of them resolves the underlying problem.

a) RAG (Retrieval from a Knowledge Base)

Retrieval-Augmented Generation (RAG) retrieves relevant passages from a database. This can make information lookup more efficient and reduce the amount of context that needs to be included in the prompt. However, RAG also works with fragments. In practice, the system typically selects passages that are semantically closest to the current query, not necessarily all passages that are relevant to the overall problem.

Moreover, the model’s context window remains limited even when using RAG. If a problem requires drawing conclusions across many documents—for example where exceptions, special cases, or contradictory clauses exist—RAG can easily return the “wrong right” passages: text that appears relevant to the question but omits the crucial details located elsewhere.

RAG therefore mainly solves a storage and search problem. It does not solve the problem of producing a reliable synthesis across large and complex bodies of information.

In fact, RAG cannot place the entire problem into the context because the context window always remains limited.

b) Specialized Smaller Models

Small, highly specialized models can indeed be cheaper and more precise for clearly defined tasks. This is particularly true for standardized and well-bounded applications. However, they are far less helpful when dealing with unforeseen situations. In those cases, what is required is transfer ability—the capability to apply insights from known situations to new and previously unseen ones.

Doing so would require reliably applying rules to new cases, drawing clear distinctions, and evaluating causal relationships. These are precisely the capabilities that current LLM systems handle only with limited reliability.

The Key Practical Point: The Difficult Cases Remain

My hypothesis is that a relatively small share of cases causes the majority of the effort, because these cases are not predictable or do not fit neatly into standardized workflows. And these are exactly the cases that current LLM systems cannot reliably handle.

This explains why many AI initiatives in practice do not deliver the efficiency gains that were initially expected:

The simple tasks become automated or partially automated.

The difficult cases still require resolution by human experts.

Additional work arises for verification, correction, liability management, and auditability.

As a result, the net benefit shrinks. In some domains it may even become negative, because while the system produces output, organizations must invest significant effort to verify whether the output is correct.

In addition, LLM-based and agent-based systems introduce considerable technical complexity alongside their high resource requirements. This raises a practical question: in many situations, it may be simpler and more robust to handle standardized processes with conventional, well-tested workflow systems—especially when complex cases still require human intervention anyway.

Conclusion: Technical Limits and the Cost Curve Do Not Match Expectations

I draw the following conclusions from the considerations above:

LLMs are a solid technology that can assist people in many tasks. However, they are not creative in the strict sense and often respond only imperfectly to new or unforeseen situations.
Today, LLMs frequently produce error rates that are unacceptable in many business processes.
The underlying causes are limited learned knowledge, limited context, and the fact that the system generates answers even when no reliable basis is available.
Larger context windows and additional compute are not a clean solution, because costs and scalability quickly become limiting factors while quality does not improve proportionally.
Tools such as RAG or highly specialized models can help, but they do not solve the fundamental problem of difficult and unpredictable cases.
Nevertheless, enormous investments are currently flowing into this area, even though the operational benefits of many applications remain below expectations. In my view, current investments are therefore often too large and inefficient.
The only way these investments could be justified in the long term would be the emergence of a new generation of technologies that are significantly more reliable than today’s LLMs. At present, however, there are no clear public indications that such a breakthrough is imminent.
As a result, the behavior of capital markets shows a pattern typical of a bubble: high expectations, massive investments, and so far too few reliable results in real-world applications.

Once this discrepancy becomes visible in corporate financial results, the market will begin to select. Many projects will likely be abandoned or significantly scaled back. What will remain are applications with clearly defined tasks, controlled environments, manageable context sizes, and architectures that do not attempt to force reliability simply by adding more hardware.

Appendix – Estimated TCO for Context Window Size

The energy consumption required to process a context window of 128,000 tokens depends strongly on the model used (especially the number of parameters) and the hardware architecture. However, the order of magnitude can be roughly estimated as follows.

1. Rule of Thumb per Request

A commonly cited estimate for modern LLMs (in the GPT-4 class) is roughly 0.01 to 0.1 watt-hours (Wh) per 1,000 tokens.

For 128,000 tokens, this corresponds to roughly 1.3 to 13 Wh per full pass.

For comparison, 13 Wh is roughly the amount of electricity required to charge a modern smartphone once or twice, or to run a 10-watt LED lamp for about one hour.

2. Input vs. Output Energy Consumption

Energy consumption is not evenly distributed between input processing and output generation.

Prefilling (input stage). Reading the 128,000 tokens into the model is computationally intensive, but modern GPUs such as the NVIDIA H100 can process these operations highly in parallel, making the energy cost per token relatively efficient.

Generation (output stage). Generating new tokens once a 128k context is already present is significantly more expensive. For each generated token, the entire KV cache (key-value memory) must remain resident and addressable in GPU memory (VRAM). This increases both memory traffic and computational overhead.

3. Hardware Impact

Data center environment (H100 GPUs). A single H100 GPU consumes roughly 350 to 700 watts under load. Large models with long contexts often require multiple GPUs, meaning that a full server system can easily reach 4,000 to 6,000 watts of total system power.

Local systems (e.g., Apple silicon workstations). Machines such as the Apple Mac Studio with M2 Ultra use shared-memory architectures that can sometimes handle long contexts more efficiently than traditional PC architectures. However, under full load they also consume substantially more power than in idle operation.

Cost Calculation Method

The costs of large-context inference are primarily driven by memory bandwidth requirements and VRAM capacity. In large providers such as OpenAI, development costs are distributed across billions of requests, while operational costs scale roughly linearly with usage.

The following simplified calculation estimates the Total Cost of Ownership (TCO) for a GPT-4-class system over a three-year hardware lifecycle.

Training and developing a state-of-the-art LLM (including training infrastructure, datasets, and expert salaries) can cost roughly $100 million.

For a TCO estimate of a continuously utilized system, these costs can be allocated across compute nodes. Assuming a data center with 10,000 H100 GPUs, the development cost allocated to a single GPU node with 8 GPUs would be roughly $150,000 in proportional R&D cost.

2. Hardware Acquisition (Capex)

Serving large context windows efficiently typically requires systems such as an NVIDIA HGX H100 server with 8 GPUs.

Estimated costs:

Hardware acquisition: $300,000

Infrastructure (networking, racks, storage): $50,000

3. Operating Costs (Opex over 3 Years)

Assuming continuous full utilization (24/7) over three years:

Power consumption

An H100 node consumes roughly 10 kW including cooling (PUE).

10kW × 24h ×3 65days × 3years = 262,800 kWh

At an industrial electricity price of approximately $0.10 per kWh, this results in roughly:

$26,000 in electricity and cooling costs.

Maintenance and hosting

Data center rental, infrastructure management, and operational staff:

≈ $30,000 over three years.

The “128k Context” Throughput Effect

A 128k context window significantly reduces throughput (tokens per second). While short contexts allow thousands of concurrent users per node, an H100 node operating with a 128k context often supports only a small number of parallel inference streams at acceptable latency.

As a result, the cost per generated token can increase by a factor of roughly 10 to 50 compared with short-context workloads.

Estimated TCO per 8-GPU Node (3-Year Lifecycle)

Cost component	Estimate (USD)
Hardware (Capex)	$350,000
Allocated development (R&D)	$150,000
Electricity & cooling	$26,000
Maintenance & operations	$30,000
Total TCO	≈ $556,000

Economic Interpretation

If such a node continuously processes 128k-context workloads, it may generate roughly 50–100 billion tokens over three years under full utilization.

This implies a cost of roughly $5–11 per million tokens, depending on utilization efficiency.

TCO Estimates for Larger Context Windows

Applying the same methodology to larger context sizes leads to the following approximate values:

Context size	Hardware (Capex)	Development share	Electricity & cooling	Maintenance & operations	Total TCO
128k (1 node)	$350,000	$150,000	$26,000	$30,000	$556,000
1.2M (4 nodes)	$1,600,000	$400,000	$105,000	$120,000	$2,225,000
3.6M (8-node cluster)	$3,200,000	$1,000,000	$236,500	$200,000	$4,636,500
7.2M tokens	$6,400,000	$2,500,000	$473,040	$450,000	$9,823,040

These estimates illustrate how costs increase rapidly with larger context windows, while throughput decreases due to higher memory pressure and reduced parallelism.

References

Stephanie Lin, Jacob Hilton, Owain Evans, 2022: TruthfulQA: MeasurTing How Models Mimic Human Falsehoods, 2026-03-15: https://arxiv.org/abs/2109.07958 ↩︎
Anila Jaleel, Umair Aziz, Ghulam Farid, Muhammad Zahid Bashir, Tehmasp Rehman Mirza, Syed Mohammad Khizar Abbas, Shiraz Aslam, Rana Muhammad Hassaan Sikander, 2025. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. 2026-03-15: https://pmc.ncbi.nlm.nih.gov/articles/PMC12495368/ ↩︎
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell, 2021: On the Dangers of Stochastic Parrots: Can Language Models Be Too Big. 2026-03-15: https://dl.acm.org/doi/10.1145/3442188.3445922 ↩︎
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020: Language Models are Few-Shot Learners. 2026-03-15: https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html ↩︎
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, Illia Polosukhin, 2017: Attention is All you Need. 2026-03-15: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html ↩︎
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2024: Lost in the Middle: How Language Models Use Long Contexts, 2026-03-15: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long ↩︎
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023: Efficient Memory Management for Large Language Model Serving with PagedAttention. 2026-03-15: https://arxiv.org/abs/2309.06180 ↩︎
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2023: FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness. 2026-03-15: https://openreview.net/pdf?id=H4DqfPSibmx ↩︎

Limits of Current LLM Technology, or Why We Are Likely in a Large Investment Bubble