The Last Mile
The last leg is often the hardest; analogies to the deployment of AI-enabled software.
Some lessons you learn over and over, not because you don’t adjust, but because adjusting to a sufficient degree is painful. “The Last Mile” is the first post on The Transducer, a Substack on AI startups and products. It covers one of the lessons I’ve had to learn repeatedly on when and why “lean startup” methodologies can backfire when developing AI products, pulling examples from self-driving cars, speech recognition, and, of course, chatbots.
While struggling through to the end of Grandma’s Marathon this past June, I wondered if, in spite of fueling well and patient pacing, the last miles would always be of leaden legs and prolonged pain?
The Problem
As with many eventually hard things, the first stages of a marathon should feel easy; it's only well into the latter half that the slog should begin. The last leg is the most excruciating.
This phenomenon can be seen in many walks and industries—in transportation and logistics, the last mile problem appears going from hubs to the ultimate, interspersed destinations. In telecommunications, the final traversal from trunklines to the leaves of individual residences constrains network bandwidth and dominates cost. The law of diminishing returns, the 80/20 principle, etc.
In AI, the last mile problem often lies in crossing some threshold of reliability.
Assuming collecting data isn’t prohibitively expensive, it’s easy to get to 90% accuracy on many tasks; harder yet to get to 95%; much harder besides to get to 99%; and often, finally, unknowably more difficult from there to iterate towards additional 9s. A few examples:
Self-Driving Cars
The general release of Waymo this June in San Francisco was a muted milestone in the history of autonomous vehicles. The dream of a self-driving vehicle trained using image data is over 30 years old. The wide, sunny roads of Arizona were useful early testing grounds for many companies, as were busy California highways. But systems also need to consider difficult urban settings: pedestrians, bikers, construction sites that change by the day. Many of the sensors used to detect the road and obstacles are not impervious to rain and snow. Commonly traveled paths have been mapped using LIDAR to the centimeter; yet, the full deployment of autonomous vehicles requires the ability to navigate unpaved roads never-before seen, and driver takeover and teleoperation remain fallbacks.
Speech Recognition
Before the incarnation of Samantha from Her, speech assistants were relegated to the simple lives of playing music on Spotify and setting timers. As with many advances in computing, the origins of speech recognition extend back at least 72 years to Bell Labs. At first, only limited lexicons were supported—for example, numbers only, reminiscent of early systems used to read digits on checks—but then ambitions shifted to conversational transcriptions on benchmarks such as Switchboard. But noisy and public settings, accents, accidental awakenings, and the discoverability of the index of commands—many hurdles still lie between the natural interface of spoken words and widespread adoption.
Language Model Hallucinations
“I apologize for the confusion.”—a common refrain, often following a mistake from LLMs with limited context and memory trained on the vast but unreliable web. Despite retrieval-based methods for gathering supporting documents and troves of examples illustrating how to walk the tightrope of being helpful but not harmful, LLMs continue to hallucinate at alarming rates. Some have argued that for certain use cases, this may be a feature, not a bug. Yet, for the prototypical AI assistant intended to answer queries or help perform menial white-collar tasks, reliability is the missing leg.
Consequences
Often, the last mile problem results in demos that don’t become adopted products. A demo that works 80% of the time is more than enough to impress an engineer’s boss, whose boss in turn can now include it in her keynote. But failure 20% of the time is enough for a user to stop using a feature altogether.
This doesn't matter as much if the product is a general information retrieval task, like search or QA that many of the recent chatbot LLMs provide. Mistakes are expected, and the set of results often contain some useful pieces regardless.
For a business inquiry, however, reliability and consistency are the expectation. Systems of record, including analytics platforms, are often just UI wrappers around a database where data integrity is the first-order concern. For systems of action (agents) the consequences of said actions outside of a small sandbox environment require utmost trust (the subject of a future post). Backing off from autonomous agents: for tightly integrated AI systems such as email assistants, long-term memory and understanding of intent and context are critical for suggestions that don’t miss the mark or require excessive revision.
But accuracy and reliability are just one dimension; consider:
Latency, throughput, and cost: While the pace of improvement here has been astounding, going from inquiry, to watching a stream of tokens, reading the response, and often sending a followup chat inquiry accumulates into significant delay. In contrast, webviews such as Google Search allow for keyword-based search, then transition into click and scroll.
Quality and expertise: While commercial LLMs have consumed nearly every publicly available scrap of text online, most of the text is noisy and low quality. Researchers have demonstrated that the provenance of the data and selecting high quality text is critical to LLM performance. Many applications benefit from finetuning, bringing the model closer, step-by-step, to some pluralistic ideal.
Guardrails and safety: With the reputation of businesses hinging on potentially unhinged behavior from user-facing LLMs, a long process of instruction finetuning is required to bring a model that performs well on benchmarks to a model that can politely, but unwaveringly, navigate the pleas and exploits of malicious users.
These are all part of the difficult last mile that must be covered to put a model into production.1 And even then a single model may just be a cog in a larger system; these issues compound once we start piecing together components and modules (the subject of an upcoming post).
Future Projections
While many AI-enabled functions are still, what will be possible in the next year, or even 3 years down the line? This is what makes things trickiest—building towards what will be possible instead of what is.2
In an interview with Microsoft CTO Kevin Scott, Ben Thompson suggests that “tech has forgotten what it was like to build on top of Moore’s Law”—inefficiencies would be taken care of in one to two upgrade cycles of Intel’s chips, and hence building on the frontier of compute capacities was critical.
Given the breakneck pace of language model development—racks of GPUs stretching as far as the eye can see, a trillion dollar cluster—the equivalent may be to build products that demand intelligence well above existing capabilities.
Perhaps then, use cases where high-levels of accuracy are not expected are but quick layovers. Perhaps what appears to be vaporware is, given only six to twelve months’ time—and coinciding with an announcement from San Francisco—condensing into a breakthrough product.
Solutions & Workarounds
With that enormous caveat and throwing differentiation aside, we believe that careful problem selection (conditioned on a given timeframe) and simplifying the problem are especially critical for AI-enabled products.3 Returning to the prior examples:
Self-Driving Cars
Self-driving vehicles: as a product with little market risk and gargantuan execution risk, both the market segment and the path to production have nevertheless seen many approaches.
Beyond the largest market of robotaxis as a replacement for the daily driver of commuters, self-driving companies have been formed (and dissolved) for autonomous trucking (TuSimple, Embark, Waabi), delivery vehicles (Starship, Nuro), and add-ons to vehicles from OEMs (MobilEye). Each specialization yields simplifications: trucks mostly drive on freeways; small delivery vehicles can move slowly and are unlikely to cause any fatal accidents; add-ons to existing vehicle platforms avoid a brutal manufacturing supply chain. In more than one conversation, peers have joked about autonomous golf carts as a favorable initial market, with low stakes and clientele of leisure.
Technical approaches have ranged from stereo cameras, to additional cameras and radar, to vehicles distinguished by spinning LIDAR cylinders. At the extremes of the spectrum sit two opposing denominations: one represented by the camera-only sect of Tesla, and the other represented by the sensor-rich sect of Waymo. What is the better approach is up to debate: (1) to avoid the problem of multimodal sensor fusion and planning—to use large fleets of consumer vehicles and train and end-to-end system that goes from camera pixels to driver actions, or (2) to avoid the problem of ever-imperfect perception—to use powerful sensors to robustly detect objects, despite a costlier journey to collect and evaluate data.
Speech Recognition
As previously mentioned, quiet settings and fluent speakers simplify the speech recognition task.
In most speech and natural language systems lies an issue often disregarded in announcements and technical reports: performance on English, the lingua franca of the academic and business domains, far outpaces performance on other languages, especially low-resource languages (Swahili an oft-mentioned example). But with data-hungry models, the sheer amount of web-scale data often determines performance.
Language Model Hallucinations
Others have observed that LLMs often do better at two extremes. From Stratechery:
So when you get to this generalizable function, what gives you confidence that the generalizability can extend to domains where it’s almost like on one extreme you have pure creativity where there is no wrong answer, works well there, there’s the other extreme where you’re operating in a domain with a validation function so that you can actually bring AI to bear in a parallel fashion and get the best answer, because you can grade it.
Put another way, the regimes are:
Creative applications (where there is no specific “correct” response), with image or poem generation as two examples, or4
Constrained applications where outputs can be validated—e.g., a snippet of code intended to perform a specific calculation).
These correspond to the selection of problems where either gestalt is more important than detail, or where programmatic checks can be put in place. This stems from abundance of data: the entire web is used to pre-train open-ended generation systems, while reliable compilers and interpreters provide ground truth outputs for synthesized code.
To simplify generation tasks, the application can ask the user for an outline or generate shorter chunks, requiring more frequent human intervention and feedback.
To simplify coding tasks, the system can simply complete a single line or code block instead of generating an entire program altogether. Even more constrained is the task of generating comments and documentation, or offering quick refactors.
Miles Ahead
In describing the explosion of interest in generative AI following the release of ChatGPT, analogies have been made to the dot-com bubble. Countless ventures from said bubble ended in an extinction event, often due, it seems, to poor timing, as many of the underlying ideas (grocery and pet food delivery, as just two examples) were successfully implemented a dozen years later.
Perhaps the most significant near-term question in AI is this: Will scaling data and compute continue to yield step-function changes in the performance of Transformer models, or will performance plateau, with further levels of intelligence only emerging with breakthrough architectures?
On the highway towards Human-Level AI, Large Language Model is an off-ramp. — Yann LeCun
If gains from simply scaling the existing formula are, in fact, approaching saturation, then with time the last mile can be carefully studied, the frontier of performance better understood as more science than alchemy.
If not, then perhaps, suddenly, a new path will appear—and the journey begins anew.
—
Sapling.ai offers a language model toolkit that covers the last mile in a turnkey API/SDK. You can find out more at https://sapling.ai/docs.
In case it bears stating: Granted, this is all moot if we arrive at superintelligence.
Often, there may be little market risk but significant execution risk.
Games where chance plays a significant role are often all the more addicting—a pull of the slots, a sudden touchdown in a football game, a capricious (AI) chat companion, etc.