Skip to content

charsetx detects charset encoding of an HTML document, and convert a non-UTF8 page body to UTF8 string.

License

Notifications You must be signed in to change notification settings

philipjkim/charsetx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

charsetx

GoDoc Go Report Card Build Status

charsetx detects charset encoding of an HTML document, and convert a non-UTF8 page body to UTF8 string.

There are 3 steps for charset detection:

  1. Return the result of charset.DetermineEncoding() if certain is true.
  2. Return the result of chardet.Detector.DetectBest() if Confidence is 100.
  3. Return charset in Content-Type meta tag if exists.

If all 3 steps fails, it returns error.

Install

go get -u github.com/philipjkim/charsetx

Example

Getting UTF-8 string of body for given URL

// Invalid UTF-8 characters are discarded to give a result 
// rather than giving error 
// if the second bool param is set to true.
r, err := charsetx.GetUTF8BodyFromURL("http://www.godoc.org", false)
if err != nil {
    fmt.Println(err)
    return
}

fmt.Println(r)

If you want to reuse or customize *http.Client instead of http.DefaultClient, use GetUTF8Body().

Getting the charset of given URL

client := http.DefaultClient
resp, err := client.Get("http://www.godoc.org")
if err != nil {
    fmt.Println(err)
    return
}

defer resp.Body.Close()
byt, err := ioutil.ReadAll(resp.Body)
if err != nil {
    fmt.Println(err)
    return
}

cs, err := charsetx.DetectCharset(byt, resp.Header.Get("Content-Type"))
if err != nil {
    fmt.Println(err)
    return
}

fmt.Println(cs)

Related Projects

License

MIT

About

charsetx detects charset encoding of an HTML document, and convert a non-UTF8 page body to UTF8 string.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages