Cleaning up HTML to Text for AI

I realize this is a much more technical text than I usually write here but for those following along on my AI journey, I wanted to share a huge time (and space) saver with you. 
Chances are you are probably connected to KM and understand some of the challenges with data formats, file types and all the good stuff that makes ingesting your existing data that much harder. First let me give you a bit of background and then I will share the script and how to use it.
Notepad++ and Python Plugin

Personal Knowledge Management (PKM)

My testing of AI is all to learn how to use it for expertise capture and enabling decision support while over coming the problems of distance, time and succession planning of experts in my clients organizations.

As a test bed, I am using my own 20+ years of Knowledge Management experience as the test bed. I am testing different methods of expertise capture (more on that in future posts) but underlying it all is the foundation of key papers, blog posts, notes from conferences, meetings, projects, dialogues with other KM Experts and for me, I manage all that information in Evernote, and publish my thoughts right her on my blog www.DeltaKnowledge.net.

Through the various iterations and different systems I am building and testing, by far the greatest amount of work has been cleaning up, categorizing and converting all that data to text.

Many of the new tools have agents that will allow me to scrape my blog into the vector database, but I have unfortunately found these quite unreliable and very hard to understand progress and completeness. Blogger allows me to export all my posts to XML and having 17 years of them in a single document for RAG gives me far more control and makes setting up each new system a breeze.

Evernote has a few more restrictions which has meant grouping documents into lots of 300 for export (a problem since I have 11,000 total and over 4,000 just on KM. However once exported as HTML files the work is largely done.

Converting XML and HTML to test

OK, now to the useful part. Both the Blogger XML files and Evernote HTML files are extremely verbose, full of formatting, scripts, tags and other information that an LLM has no idea what to do with.  In my early experiments, the cause havoc as each chunk in the vector database would often just catch the beginning or end of a paragraph and be largely useless. This severely degraded both chunk selection and response performance (See this short video if these concepts are new to you).

So now, rather than managing hundreds or thousands of files I am just managing a few really large ones.

To clean these up to plain text has been quite painful so I told Chat-GPT I was using Notepad++ and asked how I can use REGEX to clean these files up. It suggested the Python Plugin for Notepad++ and I was away!

I iteratively solved each of the problems in the files including capturing the metadata inside the tags before deleting them and clearly separating the individual files so the embedder chunks and dimensions them correctly. Below is the python code in its current state after probably 30 subtle iterations.

If you are using Notepad++, just download the python plugin then save this in a "HTML_cleanup.py" file in the plugin folder and you can run it from the menu as shown above. On my PC, the folder is "C:\Program Files\Notepad++\plugins\PythonScript\scripts"

Obviously this cleans up all HTML but is specifically built for Evernote and Blogger exports.  If you are using other data sources, then just copy this into Chat GPT along with a chunk of your raw data and ask GPT to update the Python to match your source.  I hope it saves you as much time as it is for me now.  It will process a 40MB HTML file in around 8 seconds and reduce its size to half and sometimes a third of its original size. Brilliant.

import re

# Open the active document
editor.beginUndoAction()

# Step 1: Manually replace HTML entities
content = editor.getText()
content = content.replace('&lt;', '<')
content = content.replace('&gt;', '>')
content = content.replace('&amp;', '&')
content = content.replace('&quot;', '"')
content = content.replace('&#39;', ''')
editor.setText(content)

# Step 2: Remove the specific section at the beginning
editor.rereplace(r'^/\*![\s\S]*?\*/[\s\S]*?\*/', '')

# Step 3: Remove the entire <head> section
editor.rereplace(r'<head[\s\S]*?>[\s\S]*?</head>', '')

# Step 4: Replace the initial line with instructions
editor.rereplace(r'<!DOCTYPE html>', 'This file is an archive of separate articles. Each article begins with a meta section including title, tags, source-url, etc. Each article contents should be categorised by the meta information preceding it.')

# Step 5: Remove </span> tags on empty lines
editor.rereplace(r'^\s*</span\s*>\n', '')

# Step 6: Remove <span> tags on empty lines
editor.rereplace(r'^\s*<span\s*>\n', '')

# Step 7: Remove empty <span> tags
editor.rereplace(r'<span\s*>', '')

# Step 8: Remove empty </span> tags
editor.rereplace(r'</span\s*>', '')

# Step 9: Remove closing </div> tags
editor.rereplace(r'</div>', '')

# Step 10: Remove </div> tags on empty lines
editor.rereplace(r'^\s*</div\s*>\n', '')

# Step 11: Remove opening <span> tags with any class attribute but keep the contents
editor.rereplace(r'<span\s+class="[^"]*">', '')

# Step 12: Remove opening <div> tags with any class attribute but keep the contents
editor.rereplace(r'<div\s+class="[^"]*">', '')

# Step 13: Remove <div> tags with any attribute but keep the contents
editor.rereplace(r'<div[^>]*>', '')

# Step 14: Remove <span> tags with any attribute but keep the contents
editor.rereplace(r'<span[^>]*>', '')

# Step 15: Remove <input> tags with any attribute but keep the contents
editor.rereplace(r'<input[^>]*>', '')

# Step 16: Remove <ul> tags with any attribute but keep the tag
editor.rereplace(r'<ul[^>]*>', '<ul>')

# Step 17: Remove empty anchor tags
editor.rereplace(r'<a\s+[^>]*></a>', '')

# Step 18: Remove suspect image files
editor.rereplace(r'<img\s+src="[^"]*Evernote\s*\(\d+\)[^"]*">', '')

# Step 19: Remove <div> tags on empty lines
editor.rereplace(r'^\s*<div\s*>\n', '')

# Step 20: Remove inline CSS styles
editor.rereplace(r'style="[^"]*"', '')

# Step 21: Remove HTML comments
editor.rereplace(r'<!--[\s\S]*?-->', '')

# Step 22: Remove <svg> tags and their content
editor.rereplace(r'<svg[\s\S]*?</svg>', '')

# Step 23: Remove <symbol> tags and their content
editor.rereplace(r'<symbol[\s\S]*?</symbol>', '')

# Step 24: Remove empty <img> tags without src attribute
editor.rereplace(r'<img(?![^>]*\bsrc\b)[^>]*?>', '')

# Step 25: Remove hash attributes from <img> tags
editor.rereplace(r'hash="[^"]*"', '')

# Step 26: Remove attributes from <p> tags
editor.rereplace(r'<p[^>]*>', '<p>')

# Step 27: Replace &nbsp; with a single space
content = editor.getText()
content = content.replace('&nbsp;', ' ')
editor.setText(content)

# Step 28: Replace <br> with a single space
content = content.replace('<br >', ' ')
editor.setText(content)

# Step 29: Replace lines with only spaces with a single space
editor.rereplace(r'^\s*', ' ')

# Step 30: Replace double newlines with a single newline multiple times
for _ in range(10):  # Run multiple times to ensure all double newlines are replaced
    content = editor.getText()
    content = content.replace('\n\n', '\n')
    editor.setText(content)

# Step 31: Add multiple newlines before each article
editor.rereplace(r'<meta itemprop="title"', '\n\n\n<meta itemprop="title"')

# Step 32: Extract metadata to key/value pairs
editor.rereplace(r'<meta itemprop="title" content="([^"]*)">', r'\n\n\n---------< New Article >----------\nTitle: \1\n')
editor.rereplace(r'<meta itemprop="tags" content="([^"]*)">', r'Tags: \1\n')
editor.rereplace(r'<meta itemprop="source-url" content="([^"]*)">', r'Source URL: \1\n')

# Step 33: Remove all remaining HTML tags
editor.rereplace(r'<[^>]+>', '')

# Step 34: Remove empty lines and lines with only spaces
editor.rereplace(r'\n\s*\n', '\n')

# End the undo action
editor.endUndoAction()

Just one last thing for those who wish to modify. The order of some of these steps is quite important. For example if you remove the HTML tags before you extract the Title, Tags, etc, then you will lose this data. Test, test, test!  When you first test it, make sure you keep a backup copy of your raw files (ie, save the cleaned HTML files and .TXT).

All the best!

Post a Comment

0 Comments