Irssi core bugs

Notice: Undefined index: tasklist_type in /var/www/bugs.irssi.org/includes/class.tpl.php(128) : eval()'d code on line 85 Notice: Undefined index: tasklist_type in /var/www/bugs.irssi.org/includes/class.tpl.php(128) : eval()'d code on line 90
  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category core
  • Assigned To No-one
  • Operating System All
  • Severity Medium
  • Priority Normal
  • Reported Version Irssi SVN
  • Due in Version Undecided
  • Due Date Undecided
  • Votes 0
  • Private No
Attached to Project: Irssi core bugs
Opened by Sebastian Schmidt (yath) - 2009-08-12

FS#696 - Set the UTF8 flag on SvPVs when passing them to a perl handler

Hi,

when a perl signal handler is called by irssi and a SvPV is passed, the UTF8 flag isn't set although irssi treats the data as utf8 in the recode code. This breaks, for exaple, length() on a command handler to "foo" when "/foo bär" is called, which reports incorrectly 4 instead of 3.

The individual changes i made can be seen on http://github.com/yath/irssi/commits/utf8args, but a patch that applies to both current svn and (at least) 0.8.14 is attached.

Sebastian

This task does not depend on any other tasks.

Wouter Coekaerts (coekie)
Thursday, 13 August 2009, 10:20 GMT
I've had serious problems regarding the UTF-8 flag in perl scripts too, while making Webssi. My ugly workaround is calling utf8::decode in my script on every string received from Irssi that is about to be sent to the outside.

This affects more than just parameters in signals, it's also all the strings in the objects/hashes irssi creates like Server and Channel.

I'm not sure if assuming that
* every string that is valid in UTF-8 (and doesn't contain \e) is actually UTF-8
* everything that does contain \e isn't UTF-8
is really the best for all perl scripts.
Note that recode only makes the first assumption (if it's not auto-detected, it uses the configured charsets), and only if recode_autodetect_utf8 is on.

But this patch is a step in the right direction, thanks. It's already a lot better than what it currently does: treat nothing as UTF-8.
Sebastian Schmidt (yath)
Thursday, 13 August 2009, 14:37 GMT
Hi Wouter,

> I'm not sure if assuming that
> * every string that is valid in UTF-8 (and doesn't contain \e) is actually UTF-8
I've changed my patch so it looks if recode_autodetect_utf8 is set

> * everything that does contain \e isn't UTF-8
That's not quite right. Everything that is ASCII but doesn't contain \e is treated as UTF8. Normally it wouldn't make any difference to treat every ASCII string as UTF8, but that breaks ISO-2022 (see bug#392).

The behaviour of str_is_utf8 is exactly the same as in the old recode code (as I just copy&pasted it :-P)

I've extended my patch to add the PvUTF8_on to Irssi::settings_add_str (that was the first thing I came across; I use "˙" as delimiter for splitlong.pl). If you know any other places where strings could probably be UTF8, please let me know. I suspect one could add PvUTF8_on directly to new_pv(), but I'm not quite sure if that doesn't break other things (as it's used in almost every place where irssi creates a PV).

Sebastian
Sebastian Schmidt (yath)
Thursday, 13 August 2009, 14:37 GMT
Oops, forgot the patch.
Wouter Coekaerts (coekie)
Thursday, 13 August 2009, 15:20 GMT
I haven't tested any of your code yet, but one thing I'm worried about (and will test):
Only setting this flag in signals, and not in the objects might actually break things. For example, take a channel with UTF-8 things in the name. In the "message public" signal, it will be flagged as utf-8, but in "channel created" that gives you a CHANNEL_REC, the channel name will not be flagged as utf-8. Could this make scripts believe it is a different channel?
Sebastian Schmidt (yath)
Thursday, 13 August 2009, 15:39 GMT
Hm, I think you are right.

Currently, three options come into my mind:
1.) Only flag "text" strings as UTF8, i.e. treat channel names and so on still as binary data
2.) Set the UTF8 flag to *all* perl strings, if the string looks like UTF8
3.) Rewrite irssi to store somewhere in the string if it's UTF8 (maybe use a struct with flags or so) and pass that flag to the PvSB.

I think 3) would be the most clean solution, but would need a major rewrite of nearly all of irssi's code. Honestly, I don't want to spend hours into rewriting half of irssi unless someone who can (or will ;) apply my patch in the future agress with that solution.
1) would also be an option. That would only need some flags ("may be utf8") attached to every perl hash and signal handler inside the C code.

I dunno what solution is the best, suggestions anyone?
Sebastian Schmidt (yath)
Thursday, 13 August 2009, 17:48 GMT
I put the whole UTF8 checking stuff to new_pv(). As far as I can see, only incoming data (read: either from the user or from the server) and constants are passed to new_pv. For the former it's correct to check for UTF8, for the latter it doesn't matter.

I also added a check for recode == ON, I missed that in my previous versions.
Emanuele Giaquinta (ayin)
Monday, 30 November 2009, 10:07 GMT
The approach is wrong, this stuff should not depend on recode at all. Ideally it would work in the following way:

irssi internal encoding <-> utf-8 <-> perl

Where the internal encoding is supposed to be the locale encoding currently (term_charset value), although irssi does no validation except for utf-8 and there are ways to add strings invalid wrt current locale (eval). If you want to handle only the utf-8 case use the is_utf8 function to know if the locale encoding is utf-8. The patch is also missing the support for decoding incoming strings from perl, which should be in the same change.

Loading...