Understanding the Windows Copilot Runtime

By Simon Bisson

It wan’t hard to spot the driving them of Build 2024. From the pre-event launch of Copilot+ PCs to the two big keynotes from Satya Nadella and Scott Guthrie, it was all AI. Even Azure CTO Mark Russinovich’s annual tour of Azure hardware innovations focused on support for AI.

For the first few years after Nadella became CEO, he spoke many times about what he called “the intelligent cloud and the intelligent edge,” mixing the power of big data, machine learning, and edge-based processing. It was an industrial view of the cloud-native world, but it set the tone for Microsoft’s approach to AI, using the supercomputing capabilities of Azure to host training and inference for our AI models in the cloud, no matter how big or how small those models are.

Moving AI to the edge

With the power and cooling demands of centralized AI, it’s not surprising that Microsoft’s key announcements at Build were focused on moving much of its endpoint AI functionality from Azure to users’ own PCs, taking advantage of local AI accelerators to run inference on a selection of different algorithms. Instead of running Copilots on Azure, it would use the neural processing units, or NPUs, that are part of the next generation of desktop silicon from Arm, Intel, and AMD.

Hardware acceleration is a proven approach that has worked again and again. Back in the early 1990s I was writing finite element analysis code that used vector processing hardware to accelerate matrix operations. Today’s NPUs are the direct descendants of those vector processors, optimized for similar operations in the complex vector space used by neural networks. If you’re using any of Microsoft’s current generation of Arm devices (or a handful of recent Intel or AMD devices), you’ve already got an NPU, though not as powerful as the 40 TOPS (tera operations per second) needed to meet Microsoft’s Copilot+ PC requirements.

Microsoft has already demonstrated a range of different NPU-based applications on this existing hardware, with access for developers via its DirectML APIs and support for the ONNX inference runtime. However, Build 2024 showed a different level of commitment to its developer audience, with a new set of endpoint-hosted AI services bundled under a new brand: the Windows Copilot Runtime.

The Windows Copilot Runtime is a mix of new and existing services that are intended to help deliver AI applications on Windows. Under the hood is a new set of developer libraries and more than 40 machine learning models, including Phi Silica, an NPU-focused version of Microsoft’s Phi family of small language models.

The models of the Windows Copilot Runtime are not all language models. Many are designed to work with the Windows video pipeline, supporting enhanced versions of the existing Studio effects. If the bundled models are not enough, or don’t meet your specific use cases, there are tools to help you run your own models on Windows, with direct support for PyTorch and a new web-hosted model runtime, WebNN, which allows models to run in a web browser (and possibly, in a future release, in WebAssembly applications).

An AI development stack for Windows

Microsoft describes the Windows Copilot Runtime as “new ways of interacting with the operating system” using AI tools. At Build the Windows Copilot Runtime was shown as a stack running on top of new silicon capabilities, with new libraries and models, along with the necessary tools to help you build that code.

That simple stack is something of an oversimplification. Then again, showing every component of the Windows Copilot Runtime would quickly fill a PowerPoint slide. At its heart are two interesting features: the DiskANN local vector store and the set of APIs that are collectively referred to as the Windows Copilot Library.

You might think of DiskANN as the vector database equivalent of SQLite. It’s a fast local store for the vector data that are key to building retrieval-augmented generation (RAG) applications. Like SQLite, DiskANN has no UI; everything is done through either a command line interface or API calls. DiskANN uses a built-in nearest neighbor search and can be used to store embeddings and content. It also works with Windows’ built-in search, linking to NTFS structures and files.

Building code on top of the Windows Copilot Runtime draws on the more than 40 different AI and machine learning models bundled with the stack. Again, these aren’t all generative models, as many build on models used by Azure Cognitive Services for computer vision tasks such as text recognition and the camera pipeline of Windows Studio Effects.

There’s even the option of switching to cloud APIs, for example offering the choice of a local small language model or a cloud-hosted large language model like ChatGPT. Code might automatically switch between the two based on available bandwidth or the complexity of the current task.

Microsoft provides a basic checklist to help you decide between local and cloud AI APIs. Key points to consider are available resources, privacy, and costs. Using local resources won’t cost anything, while the costs of using cloud AI services can be unpredictable.

Windows Copilot Library APIs like AI Text Recognition will require an appropriate NPU, in order to take advantage of its hardware acceleration capabilities. Images need to be added to an image buffer before calling the API. As with the equivalent Azure API, you need to deliver a bitmap to the API before collecting the recognized text as a string. You can additionally get bounding box details, so you can provide an overlay on the initial image, along with confidence levels for the recognized text.

Phi Silica: An on-device language model for NPUs

One of the key components of the Windows Copilot Runtime is the new NPU-optimized Phi Silica small language model. Part of the Phi family of models, Phi Silica is a simple-to-use generative AI model designed to deliver text responses to prompt inputs. Sample code shows that Phi Silica uses a new Microsoft.Windows.AI.Generative C# namespace and it’s called asynchronously, responding to string prompts with a generative string response.

Using the basic Phi Silica API is straightforward. Once you’ve created a method to handle calls, you can either wait for a complete string or get results as they are generated, allowing you to choose the user experience. Other calls get status information from the model, so you can see if prompts have created a response or if the call has failed.

Phi Silica does have limitations. Even using the NPU of a Copilot+ PC, Phi Silica can process only 650 tokens per second. That should be enough to deliver a smooth response to a single prompt, but managing multiple prompts simultaneously could show signs of a slowdown.

Phi Silica was trained on textbook content, so it’s not as flexible as, say, ChatGPT. However, it is less prone to errors, and it can be built into your own local agent orchestration using RAG techniques and a local vector index stored in DiskANN, targeting the files in a specific folder.

Microsoft has talked about the Windows Copilot Runtime as a separate component of the Windows developer stack. In fact, it’s much more deeply integrated than the Build keynotes suggest, shipping as part of a June 2024 update to the Windows App SDK. Microsoft is not simply making a big bet on AI in Windows, it’s betting that AI and, more specifically, natural language and semantic computing are the future of Windows.

Tools for building Windows AI

While it’s likely that the Windows Copilot Runtime stack will build on the existing Windows AI Studio tools, now renamed the AI Toolkit for Visual Studio Code, the full picture is still missing. Interestingly, recent builds of the AI Toolkit (post Build 2024) added support for Linux x64 and Arm64 model tuning and development. That bodes well for a rapid rollout of a complete set of AI development tools, and for a possible future AI Toolkit for Visual Studio.

An important feature of the AI Toolkit that’s essential for working with Windows Copilot Runtime models is its playground, where you can experiment with your models before building them into your own Copilots. It’s intended to work with small language models like Phi, or with open-source PyTorch models from Hugging Face, so should benefit from new OS features in the 24H2 Windows release and from the NPU hardware in Copilot+ PCs.

We’ll learn more details with the June release of the Windows App SDK and the arrival of the first Copilot+ PC hardware. However, already it’s clear that Microsoft aims to deliver a platform that bakes AI into the heart of Windows and, as a result, makes it easy to add AI features to your own desktop applications—securely and privately, under your users’ control. As a bonus for Microsoft, it should also help keep Azure’s power and cooling budget under control.