Using Machine Learning for Web Table data extraction

This article describes the process of extracting data in web-tables and how conventional browser testing techniques and using ML are at a convergence.

9 min readJan 10, 2025

With my role as a Test Engineer on a very “heavily-science” based set of products, I often need to check data in the form of web-tables. If anyone has experience getting data out of tables, they’ll know when I say tricky is a fairly accurate statement.

It’s not clear, when the W3 groups decided on the table structure, but what is clear is no-one really follows the same pattern. A well defined table, my opinion has a separated header otherwise known as a `thead` and let’s not worry about the rest of this story. It’s actually quite obvious to me.

Unless of course you’re a Web UI Engineer and although they should use this pattern, “well hey if it looks right — why should I care!”

Unfortunately this means Test Engineers, like me have gnarly selectors and different devs, will code in their own unique way. Or they’re using an out-of-the-box library and then they really don’t care.

DataFrames — what are these magical structures?

It’s time for me to introduce DataFrames and luckily we can use chatGPT to produce a lovely ‘completely un-verified!’ paraphrase.

A DataFrame is a two-dimensional, tabular data structure commonly used in data analysis and manipulation. It’s a core data structure in Python’s pandas library and other data manipulation frameworks like R, Apache Spark, and Julia. Think of it as an in-memory spreadsheet or SQL table.

Dealing with data in-memory, now we’re talking. This means I can create assertions against a data object. A very handy library here if I don’t mind self publicising.

So how do we go from a web-page like this!

https://www.w3schools.com/html/html_tables.asp

To an object that has a two-dimensional structure that’s a bit like a spreadsheet. Or DataFrame, like this!

{
  [ 
    {"Company": "Alfreds Futterkiste", "Contact": "Maria Anders", "Country": "Germany"},
    {"Company": "Centro comercial Moctezuma", "Contact": "Francisco Chang", "Country": "Mexico"}
  ]
}

The Good, The Bad, The Ugly

In my experiment there are three different approaches that I’m going to discuss. None of them are really ugly or good, although certain approaches could cause spaghetti-coding 🤠 — because tables are inherently built on chaos. See previous dig at Web Devs.

Pure Coding

I’ll be brief, there are many examples out there and as mentioned a library like playwright-tables is designed around a well defined table. After spending many years trying to create the perfect, one-stop solution I’m still searching, so we can move on.

Hybrid — A mixture of ML and Code

In a previous article https://samtreweek.medium.com/enhancing-web-ui-test-automation-with-machine-learning-096d6bba44f2 — I discussed ML techniques to recognise certain fields or classes of objects when authenticating.

Well here we are again, we can use segmentation to learn what and where a table is on the web-page.

tables-extraction Instance Segmentation Dataset and Pre-Trained Model by flerovium

43 open source tables images plus a pre-trained tables-extraction model and API. Created by flerovium

universe.roboflow.com

After searching the web for examples of HTML tables I took screenshots and marked up the table using segmentation. In this dataset I had 43 images and with rotated augmentation this became 99. The model was built with Yolov8s and the accuracy is pretty high.

I say high and not a number, because in my solution, the hybrid one. I only need an X,Y coordinate inside the table. In fact as long as the centre of my segmented prediction is in the table, it’s good-enough. Because now I’m going to delegate the solution back to ol’fashion UI test code.

export type UITableElement = {
    getTable: () => Promise<TableData>;
    label: string;
    model: string;
}

test('should extract table', async () => {
  await pageManager.navigateTo("https://www.w3schools.com/html/html_tables.asp");
  homePage = new HomePage(pageManager.getPage());
  const table = await homePage.table.getTable();
  console.log(table);
});

The snippet is going to use the table we saw earlier from the w3schools and from the prediction we are going to use a handy document method elementFromPoint.

The playwright-tables library under the hood is actually just grabbing the outerHTML of the table and passing it onto another library html-table-to-dataframe.

const element = document.elementFromPoint(x, y);
          
  // Navigate up the DOM tree to check if it’s inside a table
  let currentElement = element;
  while (currentElement) {
    if (currentElement.tagName === 'TABLE') {

      const selector = currentElement.id
        ? `#${currentElement.id}` 
        : currentElement.className
        ? `.${currentElement.className.split(' ').join('.')}` 
        : 'table'; 

      // Get the outerHTML of the table
      const outerHTML = currentElement.outerHTML;

      return { insideTable: true, locator: selector, outerHTML }; 
    }
    currentElement = currentElement.parentElement;
}

In this hydrid example, this is doing exactaly the same thing. The code navigates up the parent stack until we see TABLE. Once identified, it’s a simple case of packaging up the outHTML and sending it over to this library and hey-presto we get a dataframe back. 🚀

Here is the video of the Hybrid solution working in harmony!

Hybrid ML plus html-to-dataframe lib

But wait, why does the data object look so bad! Why does it not look like a perfect data structure with “Laughing Bacchus Winecellars”. 🤷‍♂

[
  {undefined: 'Country'}
  {undefined: 'Germany'}
  {undefined: 'Mexico'}
  {undefined: 'Austria'}
  {undefined: 'UK'}
  {undefined: 'Canada'}
  {undefined: 'Italy'}
]

You see, even the w3Schools have fallen into the trap. They didn’t put the header into the thead. They’ve put the th into the **** tbody and 😿.

<table>
  <tbody>
     <tr>
      <th>Company</th>
      <th>Contact</th>
      <th>Country</th>
     </tr>
     <tr>
      <td>Alfreds Futterkiste</td>
      <td>Maria Anders</td>
      <td>Germany</td>
     </tr>
  </tbody>
</table>

I will contact the maintainer and raise the obvious defect of course, th in a tbody is actually fine. Thanks W3!

Machine Learning Only (mostly)

So far I’ve only touched Machine Learning for simple identification. Step in Table Transformer (TATR) [2]. It’s even got an acronym. Thanks Microsoft for this bit of heavy weight research, this is surely going to be good.

My plan is to leverage the TATR model to convert the table to a data-object that I can work with.

But first, I have a dilemma, how to get the HTML table to TATR. I know that my simple segmentation-table-model is pretty good at finding a table. And I also know that my table search code can get the table outerHTML. I’m also aware that my screenshot might only cover a section of the table and not all.

What if I use my initial segmentation model to find and construct a new image, without the rest of the webpage and pass this isolated table directly to the TATR, this could definitely improve the results.

For now I’m going to just stick to a simple screenshot and pass this to TATR, I know in theory this would make the results more robust but this is a simple R&D attempt. So let’s get started.

The code

I’ve used the following notebook as a reference to implement, my inference layer. Using chatGPT 🤖 - with a simple prompt, “As a developer convert all of this into code a single Python class”, which you can see in the repo [1].

Google Colab

Edit description

colab.research.google.com

Using the screenshot I had already for the w3schols table, I used this as my test image to send to me new TableAnalyzer python class.

The results are quite intriguing. The most obvious error was in the country Germany with the OCR recognition mistaking a “y” for an “i”. Also the time it takes to run the prediction of ~12s — I thought was quite high. This makes me think that larger tables, both horizontally and vertically could have material impact on the speed. Perhaps something to consider, when running UI testing.

(3.11.5) sam.treweek@MAC-D27NFCJ397 roboflow-server % time python tatr.py
100%|█████████████████████████████████████████████████████████████████████████████| 6/6 [00:01<00:00,  3.64it/s]
                              0                1        2
0                       company          contact  country
1           Alfreds Futterkiste     Maria Anders  Germani
2    Centro comercial Moctezuma  Francisco Chang   Mexico
3                  Ernst Handel    Roland Mendel  Austria
4                Island Trading    Helen Bennett       UK
5  Lauahina Bacchus Winecellars  Yoshi Tannamuri   Canada
python tatr.py  17.57s user 11.69s system 206% cpu 14.174 total

Hooking it up

After adding in the new TableAnalyzer class to my Flask server and an update to my test code, I now have the following.

export type UITableElement = {
    getTableFromXY: () => Promise<TableData>;
    getTableFromTATR: () => Promise<string>;
    label: string;
    model: string;
}

test('should extract table data', async () => {
    pageManager = new PageManager();
    await pageManager.initialize();
    await pageManager.navigateTo("https://www.w3schools.com/html/html_tables.asp");
    homePage = new HomePage(pageManager.getPage());
    await homePage.accept.click();
    const tableFromXy = await homePage.table.getTableFromXY();
    console.log(tableFromXy);

    const tableFromTatr = await homePage.table.getTableFromTATR();
    console.log(tableFromTatr);

    await pageManager.close();
});

ML with TATR

Conclusion

I’m actually quite surprised the results were so good, if not varied. The obvious overall speed of TATR and its limited ability of the OCR library is a potential weakness, although the video above shows it working quite quickly. Therefore I would have a few niggles about using the pure ML method of table-data extraction to write test assertions against. There was also more coding to shape the image to present to the TATR. Hence the reason for needing a flask server to handle all the python code.

The hybrid method, even though was great at finding the table, was then reliant on the HTML structure of the Table. This could certainly be fixed to a point, however it won’t solve all the issues, because there is a potential to play wack-a-mole with all the variations of tables.

By far the most interesting observation to me, is the concept of leveraging external models, which are designed to-do one job well. Then building bespoke workflows to mix-in the desired results, this for me is the future.

Roboflow has a great offering for building out workflows, however it’s limited. I could not, for example load my TATR code unless using their bespoke solution.

Perhaps I could have improved the presentation of the table to TATR, since I can get the underlying Table HTML, I could in theory go one step further and simple increase the table size, modify the font or make css changes until I created a very simple but effective table design.

Therefore I think the answer is, “Yes Machine Learning can be used for data extraction”, but it will be a hybrid solution. What I don’t see yet is consistency between model outputs. The ONNX (Open Neural Network Exchange) has set an open standard to be exported into common formats. This would certainly be a promising way of building complex ML Workflows with the potential to switch out models when the need arrises.

If I was a Test Automation specialist like Smart Bear with Test Complete or a similar commercial integrated test software solution, I could imagine using these models would work quite well. As you have a more controlled system to integrate a model like TATR and you can upgrade them as desired.

However when building your own specific one, there really is too much fluff to resolve before you can leverage the benefits. So just like many ML projects I’ve seen, promising but not yet ready for us open-source folk.

About the Author:

Sam Treweek has been a QA Engineer for over 20 years, gaining experience across a wide range of industries. Machine Learning has always been a source of fascination for him, and after taking a break from his Master’s in Computer Science, he recognized that ML applications were finally making a tangible impact on his clients. Visual Regression has become a particular area of focus for Sam, and during his work on a Space Weather project, he posed the question: how else can an SDO AIA-0335 image be tested when it’s constantly changing?

[1] https://github.com/serialbandicoot/puppeteer-metal

[2] https://github.com/microsoft/table-transformer

@software{smock2021tabletransformer,
  author = {Smock, Brandon and Pesala, Rohith},
  month = {06},
  title = {{Table Transformer}},
  url = {https://github.com/microsoft/table-transformer},
  version = {1.0.0},
  year = {2021}
}