Skip to content

Leapwork Innovation Lab: Generative AI Locators

Claus Topholt

Claus Topholt

Leapwork’s Innovation Lab builds experimental technologies and concepts to answer the questions about what comes next in testing.

Right now, the biggest question is: What does the future for quality assurance look like and how will it interact with generative AI?

One of the experimental technologies that we’ve built to help answer this question is a prototype for a fast generative AI locator algorithm that we will dive into shortly.

While it is not a feature that will likely be part of Leapwork’s test Automation Platform anytime soon, we believe it points towards a very exciting future for test automation and generative AI.

What is the AI Locator algorithm?

Leapwork has built an algorithm that can find user interface elements on any live web page on-the-fly, using natural language text descriptions such as “username field” or “Salesforce logo.”

Using natural language is a significant improvement over methodologies previously used to find UI elements for use in test automation (together called “locators”), such as XPath and DOM-based self-healing. This is because it does not require any technology understanding to operate and is entirely focused on the requirements written by business users

The following is a quick demonstration of turning descriptive text into actual locators:

AI Locator (1)

The example above shows how a user can find any element on the live login.salesforce.com page, using a search box at the bottom of the page that we have built for demonstration purposes.

Look closely:

  • The blue “Log in” button is found with the text description “signin button” which appears nowhere on the page or in the underlying DOM structure.
  • The "Use Custom Domain" link is found using the text description "bring your own domain" which semantically means the same as the link title.
  • The Salesforce logo and the user image / product screenshot, which are both .png images, are found the text descriptions "salesforce logo" and “user picture.” In fact, “product screenshot” would also have found the latter.

Below, we’ll examine how the experimental generative AI locator algorithm works, along with some background information on the technology that powers it.

What can the algorithm do?

The purpose of the algorithm is to find target elements on live web pages for test automation purposes, using only short descriptive text requirements written by business users.

Business users typically have limited technical insights into how their application was built, but a detailed understanding of the business requirements the application was made to solve. They naturally prefer to describe target elements in their applications based on business knowledge of functionality and content. In other words, “business requirements language.”

The algorithm is focused on ingesting this type of language, particularly for packaged software. It currently works well for Microsoft D365 F&O and Salesforce Lightning.

As an example, the algorithm is capable of finding all elements on Microsoft D365 F&O web pages using only business requirements language:

d365-fo

The following are examples of elements that can be accurately found for the page above:

  • “Faves icon”
  • “The name of the record input field”
  • “Openin ofice”
  • “Link to add another lline to the product”
  • “Date of product submission input field”

The algorithm works well with non-structured input, from shorthand (“Faves icon”) to long-winded explanations (“The name of the record input field”) and misspellings (“Openin ofice”.)

However, because some applications have many similar elements, it may be necessary for users to specify things like “input field” and “label”to differentiate between them. That being said, there is significant room to further enhance the algorithm and to improve the user experience in this area.

What’s our vector, Victor?

The “secret sauce” within the algorithm is an innovation in the use of vector embeddings: one of the most powerful but least understood generative AI technologies.

Embeddings are Large Language Model (LLM)-generated representations of text in a multi-dimensional vector space, which among many things can be queried for similarity (vector distance).

It may sound complicated, but there is a way to make sense of vector embeddings without having a PhD in mathematics. You may recall vectors from school as points in (X,Y) space that you can perform math operations on. So for instance, if you have a vector (4,2) and you add (1,2) to it, you get a vector (5,4):

vector-embeddings-diagram

Now, imagine a space where instead of just numbers, vectors could represent words and phrases. We call the vectors in this space vector embeddings, because they capture and “embed” meaningful information, semantic relationships, and contextual characteristics of the data they represent.

In essence, the vector embeddings are positioned in the space based on how closely the data they represent relate to each other.

Imagine a vector of the word King in this space. Then imagine that we subtract Man from that vector, and add Woman to it afterwards. What would the resulting vector be?

The answer is, of course: Queen. It almost seems like magic, but that’s just part of how vector embeddings work.

Here’s another example: Imagine a vector space that contains vectors for hot dog, cheese burger, broccoli and pet. The distance between hot dog and pet would clearly be less than the distance between broccoli and pet, right? And hot dog and cheeseburger would be close together and some distance away from broccoli.

Although the math involved is more complicated than a two-dimensional (X,Y) space, in fact, there are typically over 1,500 dimensions in these spaces, it’s still somewhat intuitive to think about.

Vector embeddings are incredibly powerful and we have shown that they can solve the problem of finding specific elements on a web page. They are particularly well suited for packaged software such as Microsoft D365 and Salesforce where users share a common business requirements language.

Vector embeddings vs conversational prompts

While vector embeddings and conversational or prompt-based AI algorithms are found in LLMs such as OpenAI’s GPT or Anthropic’s Claude, they are not the same thing.

While there are many impressive examples of people using prompt-engineering to generate locators, they rarely (if ever) work in real-life scenarios, and typically are very slow to generate, taking 5+ seconds per request. They also incur a significant cost when used in high volumes. 

The generative capabilities of GPT-like systems is simply not well-suited to solve the problem of generating locators for packaged software based on short descriptive text.

Vector embeddings, on the other hand, are generated by a very different process inside an LLM, typically generating more than 1000 vectors per second and can create robust locators at extremely low cost. We have found that offline LLMs are as good as commercial models from OpenAI and Anthropic at generating these vector embeddings.

It’s worth noting that all generated vectors can be stored locally in a database such as PostgreSQL, where all the math operations to calculate vector similarities are performed. 

In summary, the “secret sauce” in the locator algorithm can stay at home, and does not need to be shared with external parties or cloud vendors.This is great news for many enterprises who need on-premise test automation for a multitude of reasons including security, compliance, and physical limitations.

How does the locator work?

The following is an overview of the inner workings of the algorithm:

ai-locator-algorithm-diagram-v1.0

In high-level terms, the algorithm performs the following steps:

  1. Collect potentials: All elements in the web page DOM that might be relevant are collected. Elements can be irrelevant (and thus not collected) if they, for instance, are invisible or only contain other elements. We call the collected elements “potentials” because they each represent the potential target of the query text such as “the username field.”
  2. Split into fragments: We split all the potentials (the elements) into fragments, so that text content, referenced URLs, and various other metadata are processed separately.
  3. Assign weights: We then assign different weights to the fragments based on typical relevancy. For instance, the alt text attribute on an image should typically be given a higher weight than the source URL, although the latter should not be disregarded entirely.
  4. Create embeddings: We create vector embeddings for all of the individual fragments using the OpenAI ada-002 LLM, then create hashes for each and store them in our database. The hash values are used to avoid generating embeddings for already known element fragments, as the embedding process is time-consuming and relies on external OpenAI API calls and bulk database inserts. Potentials/fragments that have not been encountered before are added on the fly.
  5. Find vector similarity: We generate a vector for the query and then use the vector math capabilities of our database to find the cosine similarity between the query vector and all the fragment vectors.
  6. Calculate results: We calculate the best matches by re-assembling the query-fragment vector results. The weights of the “potential” fragments are taken into consideration to calculate final, adjusted vector distances for all potential elements before returning a hit to the user.

This is a robust and fast algorithm; although it may take 5 seconds to create vector embeddings for all unknown potential fragments on a large web page and bulk insert them into the database. This only needs to take place once per web page URL, and this data can be re-used endlessly. In a sense, it’s possible to “scan” a web page the first time it is ever visited and build up a library of vectors for any packaged software, such as D365 F&O or Salesforce. Subsequent locator actions then take less than 100 ms each.

Drivers for packaged software

The algorithm is not technically tied to any specific type of software and is very capable of finding target elements on all types of web pages without modifications. However, we have found that there are some benefits to building a thin layer of drivers for some types of packaged software.

For instance:

  • In Microsoft D365 BC, some fields are not present in the web page DOM before a certain area is hovered by the mouse pointer. A driver could make these fields available to the algorithm, bridging this somewhat strange gap for users.
  • In Salesforce, some input fields are nested inside deep DOM structures with bespoke HTML5 web elements. The user experience can be significantly improved with a thin driver implementation that navigates these structures easily.

In contrast to previous methodologies used to find elements in test automation, such as XPath and self-healing, the drivers needed for the generative AI locator algorithm are extremely simple to implement.

Final thoughts

As mentioned at the beginning of this article, Leapwork’s Innovation Lab built the generative AI locator algorithm as experimental technology to help answer the questions about what comes next.

Although the algorithm shows a lot of potential in solving an important problem in test automation, there is still a long road ahead. For one thing, it must be further enhanced such as the ability to describe colors, relative positions, and dependencies, among other things, and drivers could be created for many different kinds of packaged software.

But far more importantly, the underlying LLM technologies are currently at a very early stage, with accuracy benchmarks for common tasks lying in the 70-85% range. The LLMs must be matured and accuracy must be raised significantly by vendors such as OpenAI and Anthropic, before an algorithm like the one presented here can be hardened enough to be useful in production.

The good news is, if or when that eventually happens, it seems likely that algorithms such as this one can play an important part in performing autonomous testing on software, using only business requirements language.

And that would be a really satisfying answer to what comes next.