In order to make the string printable as plain text, we need to strip the string of its tags. In today's post, we will examine some methods to convert HTML to plain text in Golang.
HTML is the standard markup language for Web pages.
Method 1: Using regex and the Replace function
We already have an article about how to use regex in Golang. This method makes it easy and practical to get rid of text tags. The HTML tag values are replaced with the empty string by the function replace
. This method has the problem that some HTML entities cannot be removed. But it still functions well.
func (re *Regexp) ReplaceAllString(src, repl string) string
: ReplaceAllString returns a copy of src, replacing matches of the Regexp with the replacement string repl. Inside repl, $ signs are interpreted as in Expand, so for instance $1 represents the text of the first submatch.
Here is an example of converting an HTML document to plain text by removing all the HTML tags:
package main
import (
"fmt"
"regexp"
)
func main() {
// the pattern for html tag
re := regexp.MustCompile(`<[^>]*>`)
html := `<div><h1>GoLinuxCloud</h1>
<p>This is an html document!</p></div>`
strippedHtml := re.ReplaceAllString(html, "")
fmt.Println("html: ", html)
fmt.Println("-----")
fmt.Println("html to plain text:", strippedHtml)
}
Output:
html: <div><h1>GoLinuxCloud</h1>
<p>This is an html document!</p></div>
-----
html to plain text: GoLinuxCloud
This is an html document!
Method 2: Parsing the HTML to a DOM tree
This approach to completing the assignment is the most effective. Assign the HTML text to the dummy element's innerHTML, and we'll receive the plain text from the objects of the text element.
The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects.
With HTML package, we can easily use two basic sets of APIs to parse HTML: the tokenizer API and the tree-based node parsing API.
Here is an example HTML file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Pets</title>
</head>
<body>
<p>
A list of pets
</p>
<ul>
<li>dog</li>
<li>cat</li>
<li>bird</li>
<li>rabbit</li>
</ul>
<footer>
Go Linux Cloud page
</footer>
</body>
</html>
Here is the parse
function which reads the HTML string and parses it into a DOM tree.
func parse(text string) (data []string) {
tkn := html.NewTokenizer(strings.NewReader(text))
var vals []string
for {
tt := tkn.Next()
switch {
case tt == html.ErrorToken:
return vals
// check if it is a text node
case tt == html.TextToken:
t := tkn.Token()
vals = append(vals, t.Data)
}
}
}
We will traversal all text nodes, and add all the text to a slice, then print out the slice to the console log:
package main
import (
"fmt"
"strings"
"golang.org/x/net/html"
)
func main() {
data := parse(text)
fmt.Println(data)
}
func parse(text string) (data []string) {
tkn := html.NewTokenizer(strings.NewReader(text))
var vals []string
for {
tt := tkn.Next()
switch {
case tt == html.ErrorToken:
return vals
// check if it is a text node
case tt == html.TextToken:
t := tkn.Token()
vals = append(vals, t.Data)
}
}
}
var text = `<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Pets</title> </head> <body> <p> A list of pets </p> <ul> <li>dog</li> <li>cat</li> <li>bird</li> <li>rabbit</li> </ul> <footer> Go Linux Cloud page </footer> </body> </html>`
Output:
[ Pets A list of pets dog cat bird rabbit Go Linux Cloud page ]
Method 3: Using the html2text library
A simple Golang package to convert HTML to plain text (without non-standard dependencies).
It converts HTML tags to the text and also parses HTML entities into the characters they represent. A <head>
 section of the HTML document, as well as most other tags are stripped out but links are properly converted into their href attribute.
Install
go get github.com/k3a/html2text
Usage
This is a simple example of using html2text
to get the plain text from an HTML document:
package main
import (
"fmt"
"github.com/k3a/html2text"
)
var text = `<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Pets</title> </head> <body> <p> A list of pets </p> <ul> <li>dog</li> <li>cat</li> <li>bird</li> <li>rabbit</li> </ul> <footer> Go Linux Cloud page </footer> </body> </html>`
func main() {
plain := html2text.HTML2Text(text)
fmt.Println(plain)
}
Output:
A list of pets
dog
cat
bird
rabbit
Go Linux Cloud page
Summary
The examples below show how to convert an HTML string or file to plain text. Noted that, you should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. So the best method for converting HTML documents to plain text is using the html2text or parsing text to a DOM tree and traversal the text nodes in the tree.
References
https://pkg.go.dev/regexp
https://en.wikipedia.org/wiki/Document_Object_Model
https://pkg.go.dev/golang.org/x/net/html