From Toy to Tool: Everything Looks Broken to Start

Generative AI is in fairly equal parts being claimed as the solution to all our problems while also derided as complete junk. Of course some of this discourse is due to vested interests and natural counter-reactions to hyperbolic claims from both sides, but it’s also emblematic of the over-excitement and misunderstanding that occurs with any new technology.

How Paradigm Shifts Often Begin as “Toys”

When a transformative technology first emerges, it’s easy to underestimate it. Initial iterations often feel like toys; curious but impractical novelties that can’t compete with the status quo. Given the right environment and steady investment, these toys rapidly become powerful agents of change due to both their own rapid improvement and how their unique attributes end up reshaping demand itself.

In the GPT-2 era, chatbots could produce passable snippets of text, it was enough to demonstrate some level of potential in language manipulation but with limited practical utility. GPT-3 pushed this further, producing cod poetry and short coherent narratives that, while impressive, still felt limited and quick to discern the repetitive and unnatural nature of the language.

With GPT-4, we witnessed a clear change. Tasks previously considered complex, such as generating coherent multi-step instructions, producing detailed content outlines, and even some problem solving was within grasp and finding the boundaries of limitation started to require a bit more work.

What has been truly remarkable has been to see how rapidly it has become possible to add other modalities. Starting with Gemini we gained a model that was capable of taking in images and audio alongside text and suddenly it became possible for a computer to create meaningfully useful descriptions of a scene capturing not just entities but their relationship and composition.

We’ve seen similar rapid progression in coding tasks. Early AI models could write a simple function or a few lines of code, but longer form content was unreliable with many hallucinations. It was useful for creating tests or helping with documentation but little more. This expanded to being able to write longer scripts and automate common tasks and is now at the point where this is reliable enough for many needs. The current frontier is in generating more full-featured applications and while this currently has many flaws it does not seem insurmountable and based on prior experience it is likely to become stable enough for a range of general needs over the next 18 months.

What has started as a toy, is through each iteration gaining more and capability and our understanding of its strengths and weaknesses growing such that our ability to use it for useful work is also growing.

The Power of Recombining Simple Ideas

Why do these paradigm shifts often begin as deceptively small or unimpressive? Often, the breakthrough doesn’t lie in entirely new ideas but rather in recombining simpler concepts in new environmental contexts.

A relatively recent example of this in computing is the shift from vertically scaled mainframes, powerful but costly and singular, to datacenters filled with horizontally scaled, inexpensive commodity PCs. To outsiders and the mainstream this approach initially seemed flawed and borne of destitution: the PCs were individually unreliable and less powerful, surely the only reason to do this was lack of capital to buy proper machines. To incumbents invested deeply in mainframe optimization, commodity PCs appeared ill-suited to meet enterprise-grade demands.

What these incumbents missed was the broader shift in the environment. The metric of reliability per individual machine became less critical once distributed systems, redundancy, and horizontal scaling became feasible. The market conditions and technological advancements changed the fundamental trade-offs and opened up the way for warehouse-scale computing.

AI and the Reliability Misconception

Today, a similar scenario unfolds with AI-generated code. Critics emphasize AI’s tendency to produce large volumes of unreliable code quickly. From their perspective, this seems counterproductive. Yet, it’s essential to ask, given the properties of this horizontally scalable, high-output environment, what previously unattainable possibilities now open up? In other words, yes we may currently be regressing on one dimension, but what are we gaining in others?

In the short term you can expect to see an opening up of niche examples where the lowering of the cost of bespoke software will mean the creation of a lot more software to solve small bespoke needs. Much of this will look hacky, and will be derided as “not serious software” but for those using it, it will be just as real as a piece of high quality production-grade software if it solves their needs. Not all needs are constant, some are very transient. The current barrier to creating software has tended to mean it has only been focused where there is a need for ongoing maintenance and investment.

Over time though it will expand out from this base and rather than narrowly focusing on the reliability of a single code snippet, we should instead consider the broader system-level perspective. What becomes achievable when you have a tool capable of rapidly iterating hundreds or thousands of code variations almost instantaneously? How might rapid, iterative improvements paired with automated testing and verification fundamentally change software development workflows?

What Would Need to Be True?

To harness this paradigm effectively, we must ask:

How can we leverage rapid iteration to offset individual unreliability?
What infrastructure or processes are necessary to mitigate inherent risks?
How might continuous integration, automated testing, and robust error-handling strategies shift in this context?

There is a thought experiment phrased as “Would you rather fight one horse-sized duck or 100 duck-sized horses?” and practically an answer would come down to your strategic advantage. Sometimes, managing many small individually manageable units (duck-sized horses) is far easier than taking on a single formidable opponent (a horse-sized duck).

Embracing the Shift

Paradigm shifts rarely look like improvements by traditional metrics at their onset. They’re often dismissed as toys and experimental curiosities, with incumbents fixating on how the new things fails by the old standards. The unreliability of a single commodity PC or the hallucination of a single snippet of AI code. This is the wrong yardstick.

The more important question is what new possibilities are unlocked by the unique properties of this new regime. The challenge then is not perfecting the individual component, but in mastering the new system it enables. It is about learning to harness a hundred duck-sized horses, while everyone else is still debating how to fight the single horse-sized duck.