charsetx detects charset encoding of an HTML document, and convert a non-UTF8 page body to UTF8 string.
There are 3 steps for charset detection:
- Return the result of charset.DetermineEncoding() if
certain
is true. - Return the result of chardet.Detector.DetectBest() if
Confidence
is 100. - Return charset in
Content-Type
meta tag if exists.
If all 3 steps fails, it returns error.
go get -u github.com/philipjkim/charsetx
// Invalid UTF-8 characters are discarded to give a result
// rather than giving error
// if the second bool param is set to true.
r, err := charsetx.GetUTF8BodyFromURL("http://www.godoc.org", false)
if err != nil {
fmt.Println(err)
return
}
fmt.Println(r)
If you want to reuse or customize *http.Client
instead of http.DefaultClient
,
use GetUTF8Body().
client := http.DefaultClient
resp, err := client.Get("http://www.godoc.org")
if err != nil {
fmt.Println(err)
return
}
defer resp.Body.Close()
byt, err := ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println(err)
return
}
cs, err := charsetx.DetectCharset(byt, resp.Header.Get("Content-Type"))
if err != nil {
fmt.Println(err)
return
}
fmt.Println(cs)