How to turn html body tag into string in GO? [SOLVED]

In order to make the string printable as plain text, we need to strip the string of its tags. In today's post, we will examine some methods to convert HTML to plain text in Golang.

HTML is the standard markup language for Web pages.

Method 1: Using regex and the Replace function

We already have an article about how to use regex in Golang. This method makes it easy and practical to get rid of text tags. The HTML tag values are replaced with the empty string by the function replace. This method has the problem that some HTML entities cannot be removed. But it still functions well.

func (re *Regexp) ReplaceAllString(src, repl string) string: ReplaceAllString returns a copy of src, replacing matches of the Regexp with the replacement string repl. Inside repl, $ signs are interpreted as in Expand, so for instance $1 represents the text of the first submatch.

Here is an example of converting an HTML document to plain text by removing all the HTML tags:

go

package main

import (
    "fmt"
    "regexp"
)

func main() {
    // the pattern for html tag
    re := regexp.MustCompile(`<[^>]*>`)
    html := `<div><h1>GoLinuxCloud</h1>
            <p>This is an html document!</p></div>`

    strippedHtml := re.ReplaceAllString(html, "")
    fmt.Println("html: ", html)
    fmt.Println("-----")
    fmt.Println("html to plain text:", strippedHtml)

}

Output:

html:  <div><h1>GoLinuxCloud</h1>
                        <p>This is an html document!</p></div>
-----
html to plain text: GoLinuxCloud
                        This is an html document!

Method 2: Parsing the HTML to a DOM tree

This approach to completing the assignment is the most effective. Assign the HTML text to the dummy element's innerHTML, and we'll receive the plain text from the objects of the text element.

The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects.

With HTML package, we can easily use two basic sets of APIs to parse HTML: the tokenizer API and the tree-based node parsing API.

Here is an example HTML file:

xml

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Pets</title>
</head>
<body>

<p>
    A list of pets
</p>

<ul>
    <li>dog</li>
    <li>cat</li>
    <li>bird</li>
    <li>rabbit</li>
</ul>

<footer>
    Go Linux Cloud page
</footer>

</body>
</html>

Here is the parse function which reads the HTML string and parses it into a DOM tree.

go

func parse(text string) (data []string) {
    tkn := html.NewTokenizer(strings.NewReader(text))
    var vals []string

    for {
        tt := tkn.Next()
        switch {
        case tt == html.ErrorToken:
            return vals
        // check if it is a text node
        case tt == html.TextToken:
            t := tkn.Token()
            vals = append(vals, t.Data)
        }
    }
}

We will traversal all text nodes, and add all the text to a slice, then print out the slice to the console log:

go

package main

import (
    "fmt"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    data := parse(text)
    fmt.Println(data)

}

func parse(text string) (data []string) {
    tkn := html.NewTokenizer(strings.NewReader(text))
    var vals []string

    for {
        tt := tkn.Next()
        switch {
        case tt == html.ErrorToken:
            return vals
        // check if it is a text node
        case tt == html.TextToken:
            t := tkn.Token()
            vals = append(vals, t.Data)
        }
    }
}

var text = `<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Pets</title> </head> <body> <p> A list of pets </p> <ul> <li>dog</li> <li>cat</li> <li>bird</li> <li>rabbit</li> </ul> <footer> Go Linux Cloud page </footer> </body> </html>`

Output:

[        Pets        A list of pets      dog   cat   bird   rabbit      Go Linux Cloud page     ]

Method 3: Using the html2text library

A simple Golang package to convert HTML to plain text (without non-standard dependencies).
It converts HTML tags to the text and also parses HTML entities into the characters they represent. A <head>section of the HTML document, as well as most other tags are stripped out but links are properly converted into their href attribute.

Install

go

go get github.com/k3a/html2text

Usage

This is a simple example of using html2text to get the plain text from an HTML document:

go

package main

import (
    "fmt"

    "github.com/k3a/html2text"
)

var text = `<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Pets</title> </head> <body> <p> A list of pets </p> <ul> <li>dog</li> <li>cat</li> <li>bird</li> <li>rabbit</li> </ul> <footer> Go Linux Cloud page </footer> </body> </html>`

func main() {
    plain := html2text.HTML2Text(text)
    fmt.Println(plain)

}

Output:

A list of pets 

dog
cat
bird
rabbit
 Go Linux Cloud page

Summary

The examples below show how to convert an HTML string or file to plain text. Noted that, you should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. So the best method for converting HTML documents to plain text is using the html2text or parsing text to a DOM tree and traversal the text nodes in the tree.