You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! The new version works great and is a pleasure to use!
It seems that on https://kajsotala.fi, blog2epub finds 95 posts wherein the last 3 or so appear to be PDFs, and then it tries to get them. Is that intentional?
I'm guessing it's why I get the following ValueError on post 94 or so. I worked around it by only downloading the first 92 posts :)
[INFO ] 93. None
[DEBUG ] [http ]//kajsotala.fi:80 "GET /Papers/DigitalAdvantages.pdf HTTP/11" 301 167
[DEBUG ] [https ]//kajsotala.fi:443 "GET /Papers/DigitalAdvantages.pdf HTTP/11" 404 None
[INFO ] Downloading
[Level 5 ] ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
[Level 5 ] ascii should target any language(s) of ['Latin Based']
[DEBUG ] [Encoding detection] ascii is most likely the one.
[INFO ] 94. None
[DEBUG ] [http ]//kajsotala.fi:80 "GET /Papers/CoalescingMinds.pdf HTTP/11" 301 167
[DEBUG ] [https ]//kajsotala.fi:443 "GET /Papers/CoalescingMinds.pdf HTTP/11" 301 None
[DEBUG ] [https ]//stuff.kajsotala.fi:443 "GET /Papers/CoalescingMinds.pdf HTTP/11" 200 243381
[Level 5 ] Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xd0 in position 10: ordinal not in range(128)
[Level 5 ] Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte
[Level 5 ] Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] cp037 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 581.150000 %.
[Level 5 ] cp1006 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 77.600000 %.
[Level 5 ] cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] cp1125 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 86.450000 %.
[Level 5 ] cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp1250 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 116: character maps to <undefined>
[Level 5 ] Code page cp1251 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 116: character maps to <undefined>
[Level 5 ] Code page cp1252 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 309: character maps to <undefined>
[Level 5 ] Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xaa in position 90: character maps to <undefined>
[Level 5 ] Code page cp1254 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x9e in position 103: character maps to <undefined>
[Level 5 ] Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xda in position 77: character maps to <undefined>
[Level 5 ] cp1256 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 47.100000 %.
[Level 5 ] Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 78: character maps to <undefined>
[Level 5 ] Code page cp1258 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x9e in position 103: character maps to <undefined>
[Level 5 ] cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x72 in position 51: character maps to <undefined>
[Level 5 ] cp437 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 69.950000 %.
[Level 5 ] cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] cp720 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 90.050000 %.
[Level 5 ] cp737 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 88.850000 %.
[Level 5 ] cp775 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 76.700000 %.
[Level 5 ] cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp852 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 87.800000 %.
[Level 5 ] cp855 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 70.400000 %.
[Level 5 ] Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd0 in position 10: character maps to <undefined>
[Level 5 ] Code page cp857 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xf2 in position 84: character maps to <undefined>
[Level 5 ] cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa7 in position 88: character maps to <undefined>
[Level 5 ] cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x87 in position 117: character maps to <undefined>
[Level 5 ] Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 82: character maps to <undefined>
[Level 5 ] cp875 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 347.250000 %.
[Level 5 ] Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xec in position 120: illegal multibyte sequence
[Level 5 ] Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page hp_roman8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xff in position 372: character maps to <undefined>
[Level 5 ] Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] iso8859_10 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 61.050000 %.
[Level 5 ] Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 82: character maps to <undefined>
[Level 5 ] iso8859_13 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 59.200000 %.
[Level 5 ] iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] iso8859_15 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] iso8859_16 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 59.150000 %.
[Level 5 ] iso8859_2 is deemed too similar to code page iso8859_16 and was consider unsuited already. Continuing!
[Level 5 ] Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd0 in position 10: character maps to <undefined>
[Level 5 ] iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] iso8859_5 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 64.950000 %.
[Level 5 ] Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 78: character maps to <undefined>
[Level 5 ] Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd2 in position 193: character maps to <undefined>
[Level 5 ] Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd0 in position 10: character maps to <undefined>
[Level 5 ] iso8859_9 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xda in position 77: illegal multibyte sequence
[Level 5 ] koi8_r was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 79.450000 %.
[Level 5 ] Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xaa in position 90: character maps to <undefined>
[Level 5 ] koi8_u was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 60.100000 %.
[Level 5 ] Code page kz1048 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 116: character maps to <undefined>
[Level 5 ] latin_1 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] mac_cyrillic was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 41.250000 %.
[Level 5 ] mac_greek was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 46.150000 %.
[Level 5 ] mac_iceland was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 95.500000 %.
[Level 5 ] mac_latin2 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 63.900000 %.
[Level 5 ] mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
[Level 5 ] mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
[Level 5 ] ptcp154 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 35.300000 %.
[Level 5 ] Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xf2 in position 84: illegal multibyte sequence
[Level 5 ] Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xa0 in position 112: illegal multibyte sequence
[Level 5 ] Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xa0 in position 112: illegal multibyte sequence
[Level 5 ] Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 82: character maps to <undefined>
[Level 5 ] Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
[Level 5 ] Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 80-81: illegal UTF-16 surrogate
[Level 5 ] Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 12-13: illegal UTF-16 surrogate
[Level 5 ] Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
[Level 5 ] Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
[Level 5 ] Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
[Level 5 ] Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
[DEBUG ] [Encoding detection] Unable to determine any suitable charset.
[Level 5 ] Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xd0 in position 10: ordinal not in range(128)
[Level 5 ] Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte
[Level 5 ] Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] cp037 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 581.150000 %.
[Level 5 ] cp1006 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 77.600000 %.
[Level 5 ] cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] cp1125 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 86.450000 %.
[Level 5 ] cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp1250 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 116: character maps to <undefined>
[Level 5 ] Code page cp1251 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 116: character maps to <undefined>
[Level 5 ] Code page cp1252 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 309: character maps to <undefined>
[Level 5 ] Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xaa in position 90: character maps to <undefined>
[Level 5 ] Code page cp1254 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x9e in position 103: character maps to <undefined>
[Level 5 ] Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xda in position 77: character maps to <undefined>
[Level 5 ] cp1256 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 47.100000 %.
[Level 5 ] Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 78: character maps to <undefined>
[Level 5 ] Code page cp1258 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x9e in position 103: character maps to <undefined>
[Level 5 ] cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x72 in position 51: character maps to <undefined>
[Level 5 ] cp437 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 69.950000 %.
[Level 5 ] cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
[Level 5 ] cp720 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 90.050000 %.
[Level 5 ] cp737 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 88.850000 %.
[Level 5 ] cp775 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 76.700000 %.
[Level 5 ] cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp852 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 87.800000 %.
[Level 5 ] cp855 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 70.400000 %.
[Level 5 ] Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd0 in position 10: character maps to <undefined>
[Level 5 ] Code page cp857 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xf2 in position 84: character maps to <undefined>
[Level 5 ] cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa7 in position 88: character maps to <undefined>
[Level 5 ] cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
[Level 5 ] cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
[Level 5 ] Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x87 in position 117: character maps to <undefined>
[Level 5 ] Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 82: character maps to <undefined>
[Level 5 ] cp875 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 347.250000 %.
[Level 5 ] Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xec in position 120: illegal multibyte sequence
[Level 5 ] Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xd9 in position 80: illegal multibyte sequence
[Level 5 ] Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xdb in position 82: illegal multibyte sequence
[Level 5 ] Code page hp_roman8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xff in position 372: character maps to <undefined>
[Level 5 ] Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xd0 in position 10: illegal multibyte sequence
[Level 5 ] iso8859_10 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 61.050000 %.
[Level 5 ] Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 82: character maps to <undefined>
[Level 5 ] iso8859_13 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 59.200000 %.
[Level 5 ] iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] iso8859_15 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] iso8859_16 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 59.150000 %.
[Level 5 ] iso8859_2 is deemed too similar to code page iso8859_16 and was consider unsuited already. Continuing!
[Level 5 ] Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd0 in position 10: character maps to <undefined>
[Level 5 ] iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] iso8859_5 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 64.950000 %.
[Level 5 ] Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 78: character maps to <undefined>
[Level 5 ] Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd2 in position 193: character maps to <undefined>
[Level 5 ] Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd0 in position 10: character maps to <undefined>
[Level 5 ] iso8859_9 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xda in position 77: illegal multibyte sequence
[Level 5 ] koi8_r was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 79.450000 %.
[Level 5 ] Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xaa in position 90: character maps to <undefined>
[Level 5 ] koi8_u was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 60.100000 %.
[Level 5 ] Code page kz1048 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 116: character maps to <undefined>
[Level 5 ] latin_1 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
[Level 5 ] mac_cyrillic was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 41.250000 %.
[Level 5 ] mac_greek was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 46.150000 %.
[Level 5 ] mac_iceland was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 95.500000 %.
[Level 5 ] mac_latin2 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 63.900000 %.
[Level 5 ] mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
[Level 5 ] mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
[Level 5 ] ptcp154 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 35.300000 %.
[Level 5 ] Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xf2 in position 84: illegal multibyte sequence
[Level 5 ] Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xa0 in position 112: illegal multibyte sequence
[Level 5 ] Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xa0 in position 112: illegal multibyte sequence
[Level 5 ] Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 82: character maps to <undefined>
[Level 5 ] Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
[Level 5 ] Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 80-81: illegal UTF-16 surrogate
[Level 5 ] Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 12-13: illegal UTF-16 surrogate
[Level 5 ] Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
[Level 5 ] Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
[Level 5 ] Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
[Level 5 ] Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
[DEBUG ] [Encoding detection] Unable to determine any suitable charset.
[WARNING] Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Exception in thread Thread-1 (_download_ebook):
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/tmp/blog2epub/blog2epub/blog2epub_gui.py", line 440, in _download_ebook
blog2epub.download()
File "/tmp/blog2epub/blog2epub/blog2epub_main.py", line 55, in download
self.crawler.crawl()
File "/tmp/blog2epub/blog2epub/crawlers/default.py", line 369, in crawl
art = art_factory.process()
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/blog2epub/blog2epub/crawlers/article_factory/default.py", line 238, in process
self.tree = fromstring(self.html)
^^^^^^^^^^^^^^^^^^^^^
File "/home/me/.cache/pypoetry/virtualenvs/blog2epub-tOU-dq0N-py3.12/lib/python3.12/site-packages/lxml/html/soupparser.py", line 33, in fromstring
return _parse(data, beautifulsoup, makeelement, **bsargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/.cache/pypoetry/virtualenvs/blog2epub-tOU-dq0N-py3.12/lib/python3.12/site-packages/lxml/html/soupparser.py", line 79, in _parse
root = _convert_tree(tree, makeelement)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/.cache/pypoetry/virtualenvs/blog2epub-tOU-dq0N-py3.12/lib/python3.12/site-packages/lxml/html/soupparser.py", line 152, in _convert_tree
res_root = convert_node(html_root)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/.cache/pypoetry/virtualenvs/blog2epub-tOU-dq0N-py3.12/lib/python3.12/site-packages/lxml/html/soupparser.py", line 216, in convert_node
return handler(bs_node, parent)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/.cache/pypoetry/virtualenvs/blog2epub-tOU-dq0N-py3.12/lib/python3.12/site-packages/lxml/html/soupparser.py", line 255, in convert_tag
handler(child, res)
File "/home/me/.cache/pypoetry/virtualenvs/blog2epub-tOU-dq0N-py3.12/lib/python3.12/site-packages/lxml/html/soupparser.py", line 242, in convert_tag
res = etree.SubElement(parent, bs_node.name, attrib=attribs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/lxml/etree.pyx", line 3205, in lxml.etree.SubElement
File "src/lxml/apihelpers.pxi", line 180, in lxml.etree._makeSubElement
File "src/lxml/apihelpers.pxi", line 1654, in lxml.etree._getNsTag
File "src/lxml/apihelpers.pxi", line 1672, in lxml.etree.__getNsTag
File "src/lxml/apihelpers.pxi", line 1530, in lxml.etree._utf8
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The text was updated successfully, but these errors were encountered:
@meedstrom thanks a lot for reporting, very interesting case - I didn't expect that the sitemap would contain links to files other than html/xml. I've uploaded a fix for that. It will be in 1.5.0 RC2 by the end of the week, maybe over the weekend (I found another bug regarding pyinstaller and a couple on Android, I also want to try to deploy the package on pypi).
Hi! The new version works great and is a pleasure to use!
It seems that on https://kajsotala.fi, blog2epub finds 95 posts wherein the last 3 or so appear to be PDFs, and then it tries to get them. Is that intentional?
I'm guessing it's why I get the following ValueError on post 94 or so. I worked around it by only downloading the first 92 posts :)
The text was updated successfully, but these errors were encountered: