Enhancing Web UI Test Automation with Machine Learning

10 min readDec 5, 2024

Grand View Research, “The global automation testing market size was estimated at USD 25.43 billion in 2022 and is expected to grow at a compound annual growth rate (CAGR) of 17.3% from 2023 to 2030”. [1]

My journey in Test Automation started around 1995, when web applications were pretty non-existent and desktop software was the norm. UI Automation although simple, was frequently flaky, why? Object Recognition and Synchronised Actions, these have always been the challenge for robust repetitive UI automation.

Fast forward today and web applications are ubiquitous, however Web UI Test Automation is still a challenge and the reasons have not changed. Web applications using frameworks like Angular and React have complex lifecycle processes. Web Engineers are faced with daunting tasks to synchronise data and UI. So what hope do test engineers have.

Tools such as Playwright and Cypress have attempted to wrap as much of these pain-points as possible. Changing the underlying HTML to create hooks and convenient ways to interact have become standard practice. So whats next? Machine Learning, maybe. In this article I’m going to describe some of the branches of ML, which I’ve used to practically attempt to solve the problem.

My challenge; How can I login to any website, anywhere without using the underlying HTML to interact with an object.

A caveat here, this hypotheses is not about scraping or bot-usage, this idea was chosen because the sheer volume of websites, which have variations of user authentication.

The Problem Statement

The typical user-authentication on a website will either start with a Login or Sign In button. In practice, there is also frequently cookies to accept before the user can proceed. The remaining process is well understood, except a few outliers!

So now let’s breakdown each of the process into the branches of Machine Learning and explain why these have been used.

Accept, Sign Up, Sign In, Login are commonly viewed as a buttons therefore we only have a single required action. ‘Click()’. Input fields require both ‘Click()’ and ‘Type Events’. These are the only human interactions, we need for the majority of web-applications.

In this simple example of user authentication, I have used the following branches of ML.

Object Detection
Classification
Segmentation
Natural Language Processing / Text Processing

The rationale of so many ML subfields explained

When we take a typical web-application, the human brain is looking for key indicators. We have become accustomed to looking for certain keywords and these words are usually a name on a button or link or something we should point our mouse at and click. Form fields are usually surrounded by a box and therefore we’re pre determined to locate by example.

So when we convert each of these tasks we can also break them down into user interactions. Therefore object detection and classification are the primary branches but we also need to process text and locate or surround certain objects.

Object Detection (Buttons/Links)

The model was built using Yolov9, the dataset had no augmentations. The total number of classes was 9, however there was very little representation with classes such as `Register`.

The dataset I used for building the object detection part used the following major classes. The dataset consisted of 440 images of websites, taken as a snapshot 1024 x 640 and labelled.

https://universe.roboflow.com/flerovium/auth-detection/

Login was the most frequent occurrence, however Login can also be `Log in` or `LOGIN` — in fact there are many different fonts and variants.

Therefore Object Detection was very good at picking out each class broadly. Until the model got confused with similar objects. Such as `Sign In` and `Sign Out`.

The model metrics, show a nice steady curve as the accurancy is improved. The number of epoch were 25 and although a lot of GPU processing was required. I probably could have reduced this to a lower value. On a model run of 30, I did see overfitting, but data-labelling is not fun and the accuracy was good enough for this exercise.

There is though a really nice feature in Roboflow to use your own model as the labelling predictor. Care though was taken to be as tight to the object I was bounding, as this made a really big difference to the models accuracy.

To resolve the problem of similar objects detected, I used Single Classification this was used to re-validate the initially predicted class. This meant building a workflow, which Roboflow does very nicely.

The model had a very high precisions with classes such as Login and Accept. It’s entirely possible this was due to both representation and position. The Accept button was frequently found at the lower part of the image and Login to the top-right.

A Dynamic Crop was used to extract the predicted classes. Initially the prediction was set quite high, around 60%, however I found that the Yolov9 model did not spam at low prediction scores. Therefore the Object Detection was set at 10% with a Max Detection of 5.

Object Detection and Classification Workflow

Single Classification (Buttons/Link)

The model was built using Yolo8n and had a dataset of 2500 images. I used some simple scripting to generate many version of the images and their variations. I used the assumption that most websites have a conventional font style. For example Log in was more frequent then LOG IN, so I portioned the text with the same approach.

I also used augmentation for this classification and this generated a dataset of around 6K images.

https://universe.roboflow.com/flerovium/auth-classification/dataset/5

The overal results of the workflow results were quite promising. Combing the Predictions meant I could reliably re-confirm or build some simple logic to maximise the correct outcome. If Sign In and Sign Up or Sign On, became mis-predicted in the object detection phase. Classification nearly almost predicted the values correctly.

Total failure to object detect was only found when major deviations of font or styling was picked up. Comic Sans or really unusual fonts might fall under the 10% object detection filter.

There was an initial consideration for using the output data to feed into a third model. Although I wanted to have zero code and could looking at using some form of meta-modelling. I realised that for my example the logic easily resolved this issue. I was also concerned that continuous model stacking inherently brings in other factors.

So my zero-code, which really means zero HTML/CSS browser-object interaction is starting to look quite neat.

await pageManager.navigateTo("https://www.codecademy.com");
homePage = new HomePage(pageManager.getPage());

await homePage.accept.click();
await homePage.logIn.click();

The example above is using puppeteer and the underlying code block for the button element is following.

export type UIElementButton = {
    click: () => Promise<void>;
    label: string;
    model: "form-segmentation" | "auth-detection";
}

This all sounds great, however there is still more work to be done. How do I actually click the object. As the JSON returned by the model contains the location coordinates, we can send this to Puppeteer. I’ve marked up the click location before the event is fired.

Segmentation (Form Fields)

This model was a Yolov8s with 45 images containing username and password related input fields. Form fields invariable come with two approaches. The first is to embed the text or field label above or a placeholder within. Sometimes a label is above with additional place holding.

Augmentation was applied to increase the footprint of the dataset, using a simple 15% rotation. This expanded the dataset out to 107 images.

https://universe.roboflow.com/flerovium/boxes-pxtje/model/6

Segmentation was surprisingly quite accurate, often reaching 80% predicability. However assiging the role of the input box was a challenge. There are many types of ways to provide a username and how to identify the correct one required a text based model.

Initially my workflow takes the input screenshot and identifies the segmented boxes. The OCR wil then identify any text inside the segmentation.

Segmentation and OCR Extraction Workflow

Roboflow has a great feature, which allows you to use in-built models. In this workflow after the Segmentation Detection the Roboflow plugin model OCR was used to extract whatever was inside bounding area.

The results are pretty amazing using two models stacked, we can now locate and extract the input field and it’s associated text.

Unfortately Roboflow doesn’t have the ability to upload custom models for workflows. Therefore we now have the location of the input, the text relating to the input we now just need to classify, which one is username and password.

Text Classification (Form Field Validation)

We know from using simple LogisticRegression models, that classification of text can produce great results.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import joblib

X = [
    "email", "username", "user id", "please enter username", "email address", "user name",
    "password", "password123", "enter password", "please enter password", "user password"
]
y = ["username", "username", "username", "username", "username", "username",
     "password", "password", "password", "password", "password"]

vectorizer = CountVectorizer()
X_vect = vectorizer.fit_transform(X)

model = LogisticRegression()
model.fit(X_vect, y)

joblib.dump(model, "classification_model.pkl") 
joblib.dump(vectorizer, "vectorizer.pkl")

The results are quite impressive, the following shows for the example of username, for the text value of ‘Email*’ was 71%. Therefore with a small amount of logic, we can once again refer back to the original segmentation model output, which prediction we need to extract the location coordinates.

Predictions received: [
      { label: 'password', probability: 69.06, text: 'Password*' },
      { label: 'username', probability: 71.55, text: 'Email*' }
]

And finally the code for Puppeteer to interact with the Input Element.

export type UIElementInput = {
    click: () => Promise<void>;
    fill: (keys: string) => Promise<void>;
    label: string;
    model: "form-segmentation" | "auth-detection";
}

await pageManager.navigateTo("https://www.codecademy.com");
homePage = new HomePage(pageManager.getPage());
// Accept/Login
await homePage.username.fill("usermame@example.com");
await homePage.password.fill("Password123!");
await homePage.login.click();

Once the input field is identified, Puppeteer will click the Input Element and fire in keyboard strokes.

Conclusion

The Test Automation market is huge and the entry point for engineers is always shifting as the underlying web-technology changes. It’s also apparent that Web UI Testing is expensive, not just on maintenance. Costs can reduce using more robust techniques like component mocking this reduces the quantity of UI tests required but will not eliminate.

The two pain factors of test automation mentioned earlier, object recognition and synchronisation, did not magically go away with this approach. I still found mysef waiting to interact with the object or at what point should I take a screenshot. In fact here is a table of issues and that’s not all of them.

Problems associated with ML/UI Automation

However, this was just one approach and I don’t see any reason, why building custom site related models in the future cannot be figured. Extracting form fields was suprisingly accurate and the simplicity of the interaction events was also very easy, with ‘click’ and ‘type’.

What I did not cover was extacting information from the webpage. I’ve had experience before using classification techniques https://www.npmjs.com/package/playwright-classification using a playwright library. Again this would require a lot of thought about what images can be classified.

I can also see how extracting tables, would also be a consideration, another library https://www.npmjs.com/package/playwright-tables can be used to build data-frames to provide test expectations.

There is also the possibility of getting Transformer Architecture, like chatGPT to sit over Web-Interaction models. The following statement is probably a test found in some form many thousands of time over.

Given I am Bob Sanders with valid credentials
When I authenticate on the system
Then I should be able to see my account details

The amount of code is also quite slim and can be found here https://github.com/serialbandicoot/puppeteer-metal. [2]

So did my approach solve the problem even if still a tad flaky?

I don’t think it should be too long before we see start-ups and behemoths like Playwright getting in on the act. Because eventually this will become a data science problem, solving visual regression problems, because repetitive, boring and yet incredibly important tasks in Web UI Automation Testing are still vital.

I leave you with a video of the full demo in action and please notice in the terminal the prediction values and traffic coming from the models. Surely this is the future.

About the Author:

Sam Treweek has been a QA Engineer for over 20 years, gaining experience across a wide range of industries. Machine Learning has always been a source of fascination for him, and after taking a break from his Master’s in Computer Science, he recognized that ML applications were finally making a tangible impact on his clients. Visual Regression has become a particular area of focus for Sam, and during his work on a Space Weather project, he posed the question: how else can an SDO AIA-0335 image be tested when it’s constantly changing?

Grand View Research. Automation Testing Market Size, Share, and Trends Analysis Report by Testing Type, by Service, by Endpoint Interface, by Enterprise Size, by Vertical, and Segment Forecasts, 2023–2030. Retrieved December 4, 2024, from https://www.grandviewresearch.com/industry-analysis/automation-testing-market-report.
You will need to create a Roboflow account, details are found in the ReadMe if you want to run the full program. This will also require Docker to run the Roboflow local server. Reach out via the repo Issue, for more details.