Apologies for the top-post.
Look into the NKF library that comes with ruby 1.8.5 (later
patchlevels). It has some awesome mixins into the string class that
can help you normalize your strings to utf-8. It will also set $KCODE
appropriately, which is the variable that controls your default
string encoding. This is vital. Ruby is also extremely tolerant of
malformed UTF-8, something that we can probably thank Tim Bray for.
Notes with Hpricot that may or may not still be relevant (I'm using
it for a rather large project involving utf-8 encoded HTML, and may
be using an old version - 0.5.x)
- Hpricot will preserve formatting that is not XML compliant. Be
aware of this and attempt to normalize ahead of time if necessary and
use the Hpricot::XML constructor. Libtidy does a decent job.
- Using Hpricot's built in (non-ruby) character set support is a
good way to get nothing back.
- Passing any arguments to Hpricot's constructor (other than the
content) is a good way to get malformed output back.
Really at this point, if you need something really robust and well-
tested, LibXML2 is probably a better choice and has a DOM-compliant
interface, but I don't believe the ruby support is that great. Worth
a look, if it had been an option when I started this project I'd had
been all over it. If you're an API connoisseur Hpricot is slightly
better.
Post by Eric Wilhelm# from Javan Makhmali
like ‘ in the title
and description are being mangled with strange multibyte characters
That would be utf8.
Does anyone know why this happens
An xml parser such as expat will output utf8 instead of named
character
entities for all characters which are not "<"=< and "&"=&.
That
might be configurable, but it is often dictated by the xml input. I'm
not sure exactly what is under the hood of ruby's standard rss parser
but it might well be expat.
and how I might fix / work around it?
The best way to *properly* deal with it is to treat it as
characters and
not bytes, though that means your database layer, string objects, and
output layer all need to understand characters to some extent (of
course, low-byte ascii is a subset of utf8, so you could just flag
anything loaded from bag-o-bytes storage as characters and
generally be
on your merry way.) If you're outputting to a browser, the doctype
should be utf8, etc, etc.
The improper way to deal with it is to strip them, though that can be
difficult to do on the encoded end if all you have is bytes (you
basically have to implement utf8 yourself :-) Alternatively, you could
s/&[^;]+;/thbbt/g on the front-end or other similarly hackish
workarounds.
Have fun.
--Eric
--
The opinions expressed in this e-mail were randomly generated by
the computer and do not necessarily reflect the views of its owner.
--Management
---------------------------------------------------
http://scratchcomputing.com
---------------------------------------------------
_______________________________________________
PDXRuby mailing list
IRC: #pdx.rb on irc.freenode.net
http://lists.pdxruby.org/mailman/listinfo/pdxruby
--
Erik Hollensbe
***@hollensbe.org