αExploring Built-in AI for Chrome: The Prompt API

Table of contents

  • Introduction

    Early preview Chrome Prompt API.

  • Benefits

    The Good; the Easy, the Fast and the Free.

  • Costs

    Fast and free. But what cost?

  • A new way of thinking

    Neurons not brain

  • Your brain, not theirs

    A brain is local, so should our APIs be.

  • Many of one, one of many

    Let’s make Gen AI interactions multi-threaded and nuanced

  • Conclusion

    The future is nearer than ever.

  • Links

    More resources.

June 28, 2024
By Edward.
A robot with a magicians hat making a hushing motion with his fingers next to a pair of shoelaces.

Introduction

Early preview Chrome Prompt API.

I’ve recently been invited into the early preview program for the Chrome Built-in AI (Prompt API). The built-in AI is exploratory work for what will potentially become a cross-browser standard for embedded AI. It leverages Gemini Nano on device; that means that it is bundled into your web browser and the LLM generation happens in your local browser environment.

Benefits

The Good; the Easy, the Fast and the Free.

There are three primary reasons for us to want embedded AI for our browsers. Speed, cost and usability.

As a native browser API it is easy to use. Accessing the Prompt API is as simple as these two lines of code.


const session = await window.ai.createTextSession();
const result = await session.prompt(
   "Tyingshoelaces.com are writing a really cool blog about you. What do you think about that then?"
);
      

It couldn’t be easier to get Generative AI results where we need them in the browser. I ran a few tests to check the execution time. Although I was disappointed that we are restricted to a single session (no concurrency), the performance for complicated long text generation was good. Remember, there is no latency either, so the execution time is literally from the millisecond that we made the request in our browser to the use of the result in our code.


VM975:32 Execution Time 1: 0h 0m 3s 47ms
VM975:32 Execution Time 2: 0h 0m 3s 870ms
VM975:32 Execution Time 3: 0h 0m 2s 355ms
VM975:32 Execution Time 4: 0h 0m 3s 176ms
VM975:32 Execution Time 5: 0h 0m 7s 103ms

VM975:44 Average Session Execution Time: 0h 0m 3s 910.1999999999998ms
      

The average execution time for 5 chained requests to the built-in AI is between 3-4 seconds per complete request for long text generation prompts. I ran this several times (script is included in the Github repo), and although this varies by device I’d also expect this to improve when the API gets optimized. I’ve noticed that shorter JSON generation tasks are much quicker (200-400ms).

This is more than acceptable for most use cases. We’ve also crowdsourced the issue of scale for our LLMs. Where industrial scale API usage is infamously expensive, every LLM request is handled via an experimental browser API. It feels really nice and opens up a world of possibilities.

By having Chrome users embed the model into their browser, we have a distribution mechanism with preloaded generative AI models at the point of use and without the need for large servers. This is similar to WebLLM but with a significant advantage that the models are preloaded and bundled into our browsers. This means that we can download a single model for use across ‘the internet’ rather than being forced to download a vendor specific model.

The huge positives for this experimental browser API are strong arguments for adoption; It’s fast, it’s free (or paid for by the consumer) and really easy to use.

But what are the tradeoffs?

Costs

Fast and free. But what cost?

The API is unapologetically ready only for experimentation, not for production usage. As a result a lot of the output is less refined that we would expect for more mature and hosted models. The limitations on size alongside the generalist nature of the model mean that we don’t have polished output. 

This leads to frustrations that take us back to the early days of Generative AI APIs. I found myself using a lot of prompt engineering and validation logic to get reliable JSON responses. Every few requests, the API seems to be non-responsive, it’s quite easy to confuse the response in which case the model bombs out.

There is also mention of the fact that given that this model is embedded in the browser; it opens up some value as being a ‘private’ model. I’m not sure this is relevant to most use cases, as public facing websites will still be interacting with their servers and for the average user it is hard to be certain that data is never leaving the local environment. Having said that, for internal usage and non-public facing systems that operate via a browser (corporate environments for example) this could be a bonus point. 

The lack of sophistication in the responses owing to the smaller model means that we have to be very careful about the tasks that we use this for. Architectures of the future will optimize their generative AI implementations to use the right weight (and therefore cost) for the right task. I envisage multiple small, highly tuned and task oriented LLMs, each being used for a specific output.

Having said that, all is forgivable especially as the API is explicitly designed for experimentation, not production usage.

The good
-Cost
-Scale
-Speed
-Usability
-Private

The bad
-Sacrifice in quality
-Implementation cost

As an example, if we wanted a deep analysis of current affairs we would need a large context window and sophisticated RAG flow to inform the output; embedded AI is almost certainly not the right approach. Google alludes to this in their resources.

But I have a theory that I wanted to put to the test; a harebrained, mad and tremendously fun theory; and a micro browser hosted LLM was the perfect place to do so. 

Neurons not brain

There has been a little itch I’d been wanting to scratch for a while. What if we are using LLMs all wrong? In fact, what if we’ve got the conceptual model wrong.  

As we race for ever larger context windows with expanding training data we are trying to scale Generative AI vertically. Bigger, stronger, faster, better. My jaw drops as I see people kindly asking for context windows large enough to plug in the entire internet, and then ask the algorithm in the middle to please pick out exactly the information and output that we want from this enormous lake. And faster. 

We treat every input into an LLM as an API, text goes in, magic happens, text comes out. This magic in the middle we call intelligence. The more text in, the louder the magic, the better the result. This is our current path forward. 

I can’t help wondering if we are focused on the wrong scale or zoom, an erroneous interpretation of cognition. 

The thing about thinking in general, especially creative output (which is exactly what text generation is), is that it isn’t such a simple process. It’s not a single thread. We are already seeing this in the newer models; for example in mybreakdown of the Claude 3.5 Sonnet system prompt we see that many of the recent advances in LLM output are probably not to do with the algorithm itself, but the infrastructure, systems and tuning that contextually guide the output. 

I’ve been wanting to try out a concept of tiny, fast connections meshed together to build something bigger. In the end, a context window of 100k is the same as 1k - 100 times. I suspect that even as we are focused on the grandiose, the key is in small and precise details meshed together to form something larger.  This fits in with my mental paradigm of intelligence much more than a sentient machine ‘brain’. 

This hasn’t been possible until now due to the relative inefficiency of models in general and the prohibitive cost.  Imagine Bob in accounts as we tell him we are going to 100x the number of requests to ChatGPT as we theorize that microtransactions in a mesh architecture will improve the quality of our AI systems. I don’t think Bob works at OpenAI, but for the rest of us it just isn’t feasible. 

Even a small and efficient embedded model in the browser isn’t really ready to handle my theorizing. It’s not quite fast enough and doesn’t enable concurrent requests (concurrent thoughts!); but it is a step in the right direction and we’ve come far from cloud hosted APIs charging massive fees for each request. I can’t see the functional architecture, but I can see the path towards it. 

To test out this theory, I dusted off my programming gloves, opened up a browser, and started my epic journey to a mesh architecture with 1000 multithreaded and requests. 

The results were magical. 

Your brain, not theirs

A brain is local, so should our APIs be.

I love voice. I think keyboards and mice have become extensions of our monkey brains, but they are human contraptions and are therefore limited as an interface more holistically. As technology advances, so will interfaces and at some point keyboards, mice and even screens will be as obsolete to our ancestors as oil lamps and carrier pigeons are to us. 

So, whatever I wanted to build had to be voice controlled. Luckily, there’s a browser API for that.

1. Speech Recognition API (with Speech to Text)
2. STT API
3. Prompt API
4. Internet (Accessed via a browser)

What I wanted to build was a browser controlled voice interaction demo. An intelligent website that navigates, responds and changes based on the browser context and the input using nothing other than my voice. No keyboard. No mouse. “Me, my voice, a browser and the prompt API”. Sounds like the worst children's story I’ve ever heard. I’ve probably written worse. 

Conceptually, very similar to the Rabbit device or the Humane AI pin. These are both ambitious ventures, but the problem they share is that they are trying to build an ‘AI OS’. A new AI powered interface into software. I find the goal too grandiose, essentially trying to build a new interface into the internet with a sprinkling of AI. 

Innovation is about iteration, and the internet in 2024 is ubiquitous and fundamentally intertwined with the browser. Trying to invent a human friendly AI OS interface is a similar endeavor as trying to reinvent the internet. Folks are already asking, ‘what can I do that I can’t with my mobile phone already, but better’...

Innovation requires a blending of the new and untested, but with solid and proven foundations. Too much instability, and the results will be mad scientist territory, but get the balance of the proven and the experimental just right and sometimes, just sometimes, something special happens. 

Screenshot of the browser AI prompt API in action

The cognitive paradigm that we have gotten wrong in most LLM use cases, is that we treat an engagement as a handshake. Input ← LLM → Output. Input in, output out. However, with real human interactions we have multidimensional processes that can be broken down into different thoughts and actions.


Store Attendant greets customer ->

[Thoughts]

What are they wearing, how does their style influence their buying patterns

What is their demographic, how does their age influence their buying patterns

How will gender influence their buying patterns

What kind of mood/social signals are they giving off

What have they actually said that will influence their choices

[Action]

Good morning sir, how are you


Customer greets attendant ->

[Thoughts]

Hurry up, I’m busy

Hope they have what I want (by reading my mind!)

Will they accept returns?

[Action]

Good morning, I’m looking for a pair of shoes.

We’ve gone so deep into computer science that our thought processes around the discipline have become binary. We think of inputs and outputs, true and false. The truth is that human interaction and thoughts are complicated and nuanced, we can’t reduce or simplify to binary. But what we can do is mesh this wonderful technology in new and creative ways, to break down the barriers that are homogenizing the output and turning the internet into slurry.

Many of one, one of many

Let’s make Gen AI interactions multi-threaded and nuanced

My proposal for experimentation uses the built in AI to mirror social and human interactions. Let’s use an example that I have muscle memory of; building a recommendation algorithm for ecommerce.


Thread 1: Social Cues, sentiment analysis
– How long has it taken for user to interact?
– Is their browsing behavior aggressive, slow, calm, controlled
– Have they arrived from particular source, or looking for something specific?

Thread 2: Behavior Cues, interpretation user input
– How have they begun the conversation? A greeting?
– What tone are they using?

Thread 3: User context, data we have about similar demographics and their preferences
– What age group do they belong to? How does this influence preferences?
– How do they identify? How does this influence preferences?

Thread 4: Site context, data we have how other users are using the site and trends
– What are the trending products?
      

There is no silver bullet for interpreting so many data points, and there never will be. LLMs are not a plugin “sentiment analyzer, entity classifier, jack of all trades”. LLMs are generative algorithms that can creatively and logically interpret inputs. Notice that each of the cues in the threads are not outputs, they are questions. To inform thought and generative AI, we need to ask far more questions than provide answers. We need to be sophisticated about how to get all our data points, and structured in the way that we feed these into our LLMs. So to use behavior and social cues as an example we’d need to do the following:

1. Sentiment analysis
2. Data analysis for browser behavior vs site and global averages
3. Extract referral data from requests

All of this data would be prepared and processed long before it goes to our LLM. But, once prepared we can help inform with a prompt like, 


User A is a return visitor showing signs of being slightly upset. Remember this as you deal with them, make sure to reassure them we have a returns system. [Action]: Link to our returns policy and popular products.

An alternative would be,


User B is showing signs of being impatient and have arrived looking directly for Product X. Take them to the product page and offer to add to cart.   [Action]: Navigate direct to page X and add the product to the cart.

LLMs in this sense are our agents and interpreters, but the mistake that people are making is assuming the “algorithm” is the solution for quality output. Just like real agents, our judgment is only as reliable as the data and the cues that we have to inform them. Ask more questions than you provide answers. This is an unalienable social truth and why our current expectations of LLMs are so off kilter and agents are leading many into the trough of disillusionment. Rubbish in, rubbish out. Doesn’t matter how good the algorithm is.

Just to get the cues for our recommendation algorithm, we’d need to rely on an array of specialist tools and AI infrastructure that is beyond the capabilities of all but a few platforms on the planet. But we can get there iteratively by building nuance, threads and sophistication into the infrastructure feeding our LLMs. 

And now they are in the browser, the future has never been so near. 

Screenshot of the browser AI prompt API in action part two

I built a simple prototype with mock social cues and inputs. We sprinkled a bit of user data on top and asked the Prompt API to respond to my voice with a combination of thoughts and actions. It’s nothing more than a vision of something that ‘might’ work. But by providing granular, detailed and controlled inputs into our Prompt API, we get intelligent and thoughtful responses. It’s a vision of a mesh infrastructure as multiple processes dynamically learn, reinforce and inform each other. 

It won’t work yet. But it might work someday, and the prompt engineering with voice input feels magical. It’s a destination worth driving towards.

Conclusion

The future is nearer than ever.

We are still in the early stages of LLMs, and I predict that advances shall be slower than expected and AGI (by any reasonable definition) won’t arrive for generations. But with each step on the road a world of opportunities arise. Building highly efficient, well thought out and defined infrastructure massively improves the quality of output from our LLMs, irrespective of model size or algorithm quality. 

Moving LLMs to the browser can also be understood as moving LLMs to the internet. It will be cheap, and accessible to the masses making it easy to experiment with. Forcing folk to think smaller, to build more efficiently and to add depth and nuance to their solutions is a good thing, so I’m not even too worried about ‘Micro’ models. The sophistication is in the usage not just the tool itself so this is a giant leap forward.

I attach my demo, it is throw-away code looking at a proof of concept, built upon an exploratory AI that is only suitable for demo purposes. And it only works sometimes.

Yet it is a wonderful vision of the future.

More resources.

Follow me:

Github repo