Salesforce debuts gen AI benchmark for CRM

Salesforce today announced a first-of-its-kind gen AI benchmark for CRM, which aims to help businesses make more informed decisions when choosing large language models (LLMs) for use with business applications.

“Customers don’t just want the best model,” explains Clara Shih, CEO of Salesforce AI. “They want to make sure it’s the best model that’s compliant, fits their security standards, and doesn’t break the bank.”

For businesses, Shih says, choosing an LLM for a business application is a constrained optimization problem as they balance cost, accuracy, trust and safety, and speed.

“Cost and performance optimization means you’re not going to use one model for everything,” she says. “Because cost is an issue, there’s a need to route the right work to the right model.”

Enter the new gen AI benchmark. While others exist, they tend to be academic, theoretical, and focused on general purpose gen AI without much in the way of business relevance. The Salesforce benchmark is intended to help businesses understand the pros and cons of various LLMs as part of their AI technology stack, and make informed decisions that align with their business objectives and priorities.

“When it comes to CRM applications, we want to ensure the qualities of these generative processes are aligned with the CRM goals,” says Silvio Savarese, EVP, and chief scientist of Salesforce Research. “The idea is that if a customer has certain needs about use cases, or costs to serve, or latency, they can look at our results, tabular data, and plots and graphs, and they can make an informed decision.”

Furthermore, this benchmark doesn’t rely on automated evaluations based on LLMs or synthetic data. Experienced people researched and identified the criteria used to evaluate the LLMs, and the evaluation uses real-world CRM data. Savarese says this approach allows for a comprehensive evaluation of the practical business utility of AI in a wide range of CRM use cases, including sales and service scenarios.

Key metrics

The benchmark, created as a collaboration between Salesforce’s Frontier AI applied research group and the company’s core product and engineering teams, uses human professionals and real CRM data to evaluate LLMs in those four key areas: accuracy, cost, speed, and trust and safety.

  • Accuracy is comprised of four subcategories: factuality, completeness, conciseness, and instruction-following. Salesforce notes that the more accurate an LLM’s predictions, the more valuable the results are to an organization, and the better the organization can leverage those results to improve the customer experience. It also notes that even if a given model is not accurate enough for a use case, it can be improved through prompt engineering and fine-tuning.
  • Cost categorizes the estimated operational costs of an LLM for various CRM use cases as high, medium, or low. Customers can use this metric to evaluate the cost-effectiveness of LLMs against their budget and resource allocation strategies.
  • Speed assesses responsiveness and efficiency in processing and delivering information. Salesforce notes that faster response times improve the user experience, reduce wait times for customers, and help sales and service teams address inquiries and issues more efficiently.
  • Trust and safety focuses on how the model handles sensitive customer data, adheres to data privacy regulations, and secures information.

“The interesting aspect of this study is that the answer isn’t always the biggest model,” Savarese says. “You can actually obtain very satisfactory performance by using models that are smaller and more effective from the cost and latency perspective.”

He adds that this is only the first iteration of the benchmark.

“This is just the beginning,” he says. “We’re committed to continuing this investigation. We want to expand with more metrics, use cases, data, and more annotations.”

In particular, he says, this first benchmark only covers base models. The team is already working on evaluating how the performance and accuracy of models improve when fine-tuned on CRM data.

“This is where we’re going to see a lot of differentiation coming,” he says.

© Foundry