Regex Example - Strip Out HTML Tags

First and foremost, HTML is not regex friendly. You should not try to parse HTML in PowerShell, or using regular expressions unless you’ve lost some kind of bet or want to punish yourself for something. PowerShell has things like ConvertTo-HTML that will make that kind of thing way less migraine inducing.

That said, I recently had a situation where I just wanted to strip all the HTML tags out of a string. My input looked something like this (assigned to a variable $html).

<html>
<body>
<p>This is an important value</p>
</body>
</html>

All I want is the “This is an important value” part, so this seemed like a place where the “don’t use regex on HTML” rule could be broken. It’s even a pretty simple regex.

$html -replace '(<\/*\w+?>){1,}'

You’ll have to wrap it in round brackets and use a .trim() to clean up white space, but this will work for the “get rid of the HTML” goal. Let’s break down this regular expression to see what it’s doing.

Starting on the far right side, the {1,} is specifying “one or more” or the pattern that precedes it, in this case, the rest of the expression wrapped in round brackets. Inside those round brackets is a patter which states “an angle bracket (<), zero or more forward slashes escaped by a back slash (\/*), as many alphanumeric characters as it takes (\w+?) to get to a closing angle bracket (>)”. It just rolls off the tongue, right?

Basically we’re looking for any opening or closing HTML tag. We’re not capturing some HTML, though like img or tags that can have other values inside them (like <img src=”pic.png” />) but the regex in this example can easily be built upon to include examples like that, now that you’ve got this far. You could even just replace the \w with [^>] which means “any character except a closing angel bracket”.

Happy regexing!

Written on February 21, 2018