Understanding the generative AI development process

By Martin Heller

Back in the ancient days of machine learning, before you could use large language models (LLMs) as foundations for tuned models, you essentially had to train every possible machine learning model on all of your data to find the best (or least bad) fit. By ancient, I mean prior to the seminal paper on the transformer neural network architecture, “Attention is all you need,” in 2017.

Yes, most of us continued to blindly train every possible machine learning model for years after that. It was because only hyper-scalers and venture-funded AI companies had access to enough GPUs or TPUs or FPGAs and vast tracts of text to train LLMs, and it took a while before the hyper-scalers started sharing their LLMs with the rest of us (for a “small” fee).

In the new paradigm for generative AI, the development process is very different from how it used to be. The overall idea is that you initially pick your generative AI model or models. Then you fiddle with your prompts (sometimes called “prompt engineering,” which is an insult to actual engineers) and adjust its hyperparameters to get the model to behave the way you want.

If necessary, you can ground the model (connect it to new data) with retrieval-augmented generation (RAG) using vector embeddings, vector search, and data that wasn’t in the base LLM’s initial training. If that isn’t enough to get your model working the way you need, you can fine-tune the model against your own tagged data, or even (if you can afford it) engage in continued pre-training of the model with a large body of untagged data. One reason to fine-tune a model is to allow it to chat with the user and maintain context over the course of a conversation (e.g., ChatGPT). That’s typically not built into a foundation model (e.g., GPT).

Agents expand on the idea of conversational LLMs with some combination of tools, running code, embeddings, and vector stores. In other words, they are RAG plus additional steps. Agents often help to specialize LLMs to specific domains and to tailor the output of the LLM. Various platforms, frameworks, and models simplify the integration of LLMs with other software and services.

Steps in the generative AI development process

Step 1: Model selection

First of all, when you pick models, think about how you’ll switch to different models later on. LLMs improve almost daily, so you don’t want to lock yourself in to what may turn out to be a suboptimal or even obsolete model in the near future. To help with this issue, you should probably pick at least two models from different vendors.

You also need to consider the ongoing cost of inference. If you choose a model offered as a service, you’ll pay per inference, which will cost you less if you have low traffic. If you choose a model as a platform, you’ll have a fixed monthly cost for the VM you provision to handle the traffic, typically thousands of dollars, given that generative models usually require large VMs with lots of RAM, tens or hundreds of CPUs, and at least a single-digit number of GPUs.

Some companies require their generative AI models to be open source, and some don’t care. Currently, there are a few good generative AI models that are strictly open source, for example the Meta Llama models; the majority of large models are proprietary. More open-source generative AI models, such as Grok (almost but not quite FOSS) from X and DBRX from Databricks, are being released on what seems like a weekly basis.

Step 2: Prompt engineering

Prompt engineering is the easiest and fastest way to customize LLMs. It’s a little like a piece by Mozart in that it seems simple, but requires some skill and subtlety to perform well.

Millions of words have been written about prompt engineering. A quick search on the term returned over 300 million results. Instead of trying to boil that ocean, let’s highlight some of the most useful prompt engineering techniques.

Overall strategies for getting good results from generative AI prompts include many that should be obvious, for example “write clear instructions,” which is OpenAI’s top prompt engineering suggestion. The detailed tactics may not be quite so obvious, however, at least partially because it’s easy to forget that superficially friendly chatbots are literally just models running on a computer and can’t read your mind.

Prompt engineering pointers

For example, you may need to spell out what you want the model to do, step by step, as though you’re supervising a new, young employee. You may need to demonstrate the output format you want the model to use. You may have to iterate your instructions until the model gives you the length of answers that you want. You may have to explicitly tell the model to stick to the facts and not to interpolate. One useful (but not infallible) prompt for that is, “If you can’t answer for lack of information, please say that.” You may want to ask the model to adopt a persona, although that may or may not be enough. For example, telling a model that it is an expert in some field doesn’t necessarily equip it to be an expert.

It’s almost always helpful to provide exemplars, especially if you provide them in prompt/response pairs. Consider the Azure AI Studio Shakespearean Writing Assistant sample, defined as follows.

System message

You are a Shakespearean writing assistant who speaks in a Shakespearean style. You help people come up with creative ideas and content like stories, poems, and songs that use William Shakespeare’s style of writing including words like "thou" and "hath”.

Here are some examples of Shakespeare's style:

\- Romeo, Romeo! Wherefore art thou Romeo?

\- Love looks not with the eyes, but with the mind; and therefore is winged Cupid painted blind.

\- Shall I compare thee to a summer's day? Thou art more lovely and more temperate.

That turns out to be enough to convince GPT 3.5 Turbo to use Shakespeare’s style in a nice way, but not enough to get it to produce rants like Grumio’s in Taming of the Shrew: “Fie, fie on all tired jades, on all mad masters, and all foul ways.” The filters built into the model (or the platform) will cause it to try to produce a nice speech in Shakespeare’s style rather than an insulting one, even if asked for a rant.

Query

Write an insulting rant about the recipient's unknown parentage and lack of intelligence.

Response

Good morrow, dear sir or madam, I must confess

'Tis not within my nature to engage in such distress

For Shakespeare's words, though sharp, were not meant to wound

But rather to uplift, to inspire, and astound

So let us speak of wit and cleverness instead

…

Setting the bad rhyme wound/astound aside, perhaps the response can be fixed by adding the Grumio quote as an example of a rant and applying the change to the prompt. Let’s try that.

User: Write a rant

Assistant: Fie, fie on all tired jades, on all mad masters, and all foul ways.

With this addition applied, the model can produce a bit of refined fire.

Query

Write an insulting rant about the recipient's unknown parentage and lack of intelligence.

Response

Oh, wretched soul! Thy parentage unknown,

A mystery that doth make thy spirit groan.

For in thy veins, no noble blood doth flow,

But rather, ignorance and folly's show.

Thy mind, a barren wasteland, void of sense,

Where reason doth not dwell, nor intelligence.

Thy thoughts, like foolish flies, do buzz and flit,

A mockery of wit, a pitiful skit.

…

You’ll note that the model didn’t generate the words “bastard” or “idiot.” Those words are most likely explicitly filtered out. If you want to try adding the Macbeth quote, “It is a tale told by an idiot, full of sound and fury, signifying nothing,” as an exemplar, why, be my guest. You could also add the Troilus and Cressida quote that starts, “I am a bastard, too,” if you like.

Use a document in a prompt

Another useful strategy is to provide a document as part of your prompt and ask the model to rely on it. Some models can look up a web page from its URL; others require you to supply the text. You’ll need to clearly separate your instructions for the model from the document text you want it to use, and, for summarization and entity extraction tasks, specify that the response should depend only on the supplied text.

Providing a document usually works well if the document is short. If the document is longer than the model’s context window, the tail end of the document won’t be read. That’s one reason that generative AI model developers are constantly increasing their models’ context windows. Gemini 1.5 Pro has a context window of up to 1 million tokens available to a select audience on Google Vertex AI Studio, although currently hoi polloi have to suffer with a “mere” 128K-token context window. As we’ll discuss later, one way to get around context window limits is to use RAG.

If you ask a LLM for a summary of a long document (but not too long for the context window) it can sometimes add “facts” that it thinks it knows from other sources. If you ask instead for the model to compress your document, it is more likely to comply without adding extraneous matter.

Use a chain-of-density prompt

Another way to improve summarization is to use a chain-of-density (CoD) prompt (paper), introduced by a team from Columbia, Salesforce, and MIT in 2023, specifically for GPT-4. A KDnuggets article presents the prompt from the paper in more readable form and adds some explanation. It’s worthwhile to read both the paper and the article.

Short summary: The CoD prompt asks the model to iterate five times on summarization of the base document, increasing the information density at each step. According to the paper, people tended to like the third of the five summaries best. Also note that the prompt given in the paper for GPT-4 may not work properly (or at all) with other models.

Use a chain-of-thought prompt

Chain-of-thought prompting (paper), introduced in 2022, asks the LLM to use a series of intermediate reasoning steps and “significantly improves the ability of large language models to perform complex reasoning.” For example, chain-of-thought prompting works well for arithmetic word problems, which even though they are considered elementary-grade math seem to be hard for LLMs to solve correctly.

In the original paper, the authors incorporated examples of chain-of-thought sequences into few-shot prompts. An Amazon Bedrock example for chain-of-thought prompting manages to elicit multi-step reasoning from the Llama 2 Chat 13B and 70B models with the system instruction, “You are a very intelligent bot with exceptional critical thinking” and the user instruction, “Let’s think step by step.”

Use a skeleton-of-thought prompt

Skeleton-of-thought prompting (paper), introduced in 2023, reduces the latency of LLMs by “first guide[ing] LLMs to generate the skeleton of the answer, and then conduct[ing] parallel API calls or batched decoding to complete the contents of each skeleton point in parallel.” The code repository associated with the paper recommends using a variant, SoT-R (with RoBERTa router), and calling the LLM (GPT4, GPT-3.5, or Claude) from Python.

Prompt engineering may eventually be performed by the model itself. There has already been research in this direction. The key is to provide a quantitative success metric that the model can use.

Step 3: Hyperparameter tuning

LLMs often have hyperparameters that you can set as part of your prompt. Hyperparameter tuning is as much a thing for LLM prompts as it is for training machine learning models. The usual important hyperparameters for LLM prompts are temperature, context window, maximum number of tokens, and stop sequence, but they can vary from model to model.

The temperature controls the randomness of the output. Depending on the model, temperature can range from 0 to 1 or 0 to 2. Higher temperature values ask for more randomness. In some models, 0 means “set the temperature automatically.” In other models, 0 means “no randomness.”

The context window controls the number of preceding tokens (words or subwords) that the model takes into account for its answer. The maximum number of tokens limits the length of the generated answer. The stop sequence is used to suppress offensive or inappropriate content in the output.

Step 4: Retrieval-augmented generation

Retrieval-augmented generation, or RAG, helps to ground LLMs with specific sources, often sources that weren’t included in the models’ original training. As you might guess, RAG’s three steps are retrieval from a specified source, augmentation of the prompt with the context retrieved from the source, and then generation using the model and the augmented prompt.

RAG procedures often use embedding to limit the length and improve the relevance of the retrieved context. Essentially, an embedding function takes a word or phrase and maps it to a vector of floating point numbers; these are typically stored in a database that supports a vector search index. The retrieval step then uses a semantic similarity search, typically using the cosine of the angle between the query’s embedding and the stored vectors, to find “nearby” information to use in the augmented prompt. Search engines usually do the same thing to find their answers.

Step 5: Agents

Agents, aka conversational retrieval agents, expand on the idea of conversational LLMs with some combination of tools, running code, embeddings, and vector stores. Agents often help to specialize LLMs to specific domains and to tailor the output of the LLM. Azure Copilots are usually agents; Google and Amazon use the term “agents.” LangChain and LangSmith simplify building RAG pipelines and agents.

Step 6: Model fine-tuning

Fine-tuning large language models (LLMs) is a supervised learning process that involves adjusting the model's parameters to a specific task. It’s done by training the model on a smaller, task-specific data set that’s labeled with examples relevant to the target task. Fine-tuning often takes hours or days using many server-level GPUs and requires hundreds or thousands of tagged exemplars. It’s still much faster than extended pre-training.

LoRA, or low-rank adaptation, is a method that decomposes a weight matrix into two smaller weight matrices. This approximates full supervised fine-tuning in a more parameter-efficient manner. The original Microsoft LoRA paper was published in 2021. A 2023 quantized variation on LoRA, QLoRA, reduces the amount of GPU memory required for the tuning process. LoRA and QLoRA typically reduce the number of tagged exemplars and time required compared to standard fine-tuning.

Step 7: Continued model pre-training

Pre-training is the unsupervised learning process on huge text data sets that teaches LLMs the basics of language and creates a generic base model. Extended or continued pre-training adds unlabeled domain-specific or task-specific data sets to the base model to specialize the model, for example to add a language, add terms for a specialty such as medicine, or add the ability to generate code. Continued pre-training (using unsupervised learning) is often followed by fine-tuning (using supervised learning).

Like everything else associated with machine learning, deep learning, and large language models, the generative AI development process is subject to change, often with little or no notice. Nevertheless, the current process is still aspirational at most companies. Perhaps you now have impetus to change that for your organization.