Latest Changes: Update for new patch iconv-7d-2.diff which fixes the bounce command if send-charset was set, the chopped subject, handles all header fieds with assumed-charset and adds support for overlong rfc2047 encoded words.

PINE UTF-8 FAQ

Q: Where can I get the latest patch / RPMs?

http://www.suse.de/~bk/pine/iconv/

New: Pine 4.56 is out!

The newest patch is always available in the Pine 4.56 directory

The patch for patchlevel 7d-2 in this directory fixes the bounce command if send-charset was set, the chopped subject, handles all header fieds with assumed-charset and adds support for overlong rfc2047 encoded words.

Older patches for Pine 4.55
The well tested patch for 4.55 is in the stable directory. A newer patch with send-charset implemented by Jungshik Shin is in the experimental directory. Even newer is a patch in the 4.56 directory

If there's a RPM subdirectory in these directories, they include rpm packages for SuSE 8.1 or newer (8.2) / equivalent (glibc-2.2 or higher)

Any contribution(suggestions, tips, patches, test cases) is welcome.

Q: Is the charset conversion applied during printing?

Yes, translation is done to the character-set configured in the pine config right now.

There could also be an issue of line breaking during printing which could be circumented by preventing line breaking for pinting and let the printer or print system do the line breaking.

Print should possibly be enhanced to convert to a separate character set for which the printer or print system is configured.

Report from a tester:

I have had time to test printing, and it works quite well! Character set conversion to UTF-8 happens. This is good in my case -- I am doing "attached print", and my emulator handles the UTF-8 codes in the print stream.

However, I can imagine that it would be a problem for other people who were writing directly to a printer. There are not many (any?) UTF-8 printers around. If the user had a printer that worked in Big5 (Chinese), and they usually got Big5 emails and printed them, then installing your patch would break things. I think the best solution is to create an additional configuration variable for "printer character set"; or maybe a flag associated with each printer definition.

(I do one other change for printing. In cmd_print, where it calls format_message, I add a flag FM_NOWRAP, so it doesn't break my lines based on the number of screen columns.)

If not printing to an printer attached to the terminal, it should be possible to use a character set filter as personal print command in Setup/Printers. For instance, you can use Juliusz Chroboczeck's cedilla has a good WGL4 coverage sufficient for Latin, Greek and Cyrillic text. Another good printer filter is uniprint included in Gapar Sinai's Yudit.

 Personally selected print command
      The text to be printed will be piped into the command given here. The
      command is in the 2nd column, the printer name is in the first column. Som
      examples are: "prt", "lpr", "lp", or "enscript". The command may be given
      with options, for example "enscript -2 -r" or "lpr -Plpacc170". The
      commands and options on your system may be different from these examples.

Q: Are message headers such as Subject converted?

Yes, they are thanks to Jungshik's work. Message headers properly encoded compliant to RFC 2047 have present little problem. However, there are a lot of web mail services and mail clients that send out untagged raw 8bit characters in the message headers. Those characters are assumed to be in assumed-charset and converted to character-set.

Q: Is any charset conversion applied when reading and writing local files?

When you use the Export command, the same conversion is applied as is applied to the message display. is done. Export saves only the message headers currently in display. That is, all other headers not in view are not exported.

Saving (a) message(s) to a folder with Save is an entirely different thing. This operation preserves all information present in the message.

For reading local text or binary files, e.g. for attaching them to a outgoing message, no translation is applied.

The same holds for saving attachments, they should not be converted and should be preserved over the whole message pipeline for read and write.

Nothing in this area should be translated by default because a user should store files in the same charset as her file system charset

If neccesary, charset conversion might be implemented for reading/writing local files, but as system files can normally be converted by other programs as well, there is no need to it in the mail reader normally.

It is easy to write a shell or perl script to convert all files under a given directory to UTF-8. If you need it, I can check where it is.

Q: Are keystrokes coming into pine assumed to be in UTF-8 also?

Yes, if you use pine in an UTF-8 terminal and set charset to UTF-8 it is assumed that the keystrokes also arrive in UTF-8, like e.g. the xterm of XFree86 does.

Unfortunately, Pine's input and cursor movement does not yet support multibyte characters. When you type a UTF-8 character in Pico (Pine's default built-in editor) or in the message header lines, the cursor moves by two, three or four columns because Pine assumes that a single character takes a single byte to represent and it takes a single column width to display a single character represented in a single byte. Both assumptions don't hold for UTF-8 and multibyte charsets in general.

If the bytes of the multibyte characters are not received in quick succesion by the terminal, it may not display them correctly. Fortunately, you can refresh the screen by pressing Ctrl-L as you do in VI.

The cursor positioning is a real problem, however. As soon as you enter the first multibyte character into a line, pine will assume that the cursor has to move 2, 3 or 4 column widths. Likewise, if you step back with left arrow or backspace, you have to do it for every byte of the multibyte character. So this is a real pane.

So if you use multibyte characters, instead of using the default editor Pico, you're advised to configure alternate-editor to edit outgoing messages within the Pine session Unfortunately Pine does not support passing the subject and recipients into the external text editor and taking them back when the external editor exits. Therefore, for message headers (for instance, Subject), you either have to use cut and paste or enter the US-ASCII Charaters first and then, from right to left, insert the multibyte characters.

Aprart from this, UTF-8 characters go out transparently from the editor to the mail and are labeled as charset UTF-8, when you set the characters-set setting in pine to "utf-8".

Jungshik's new patch makes it possible to convert outgoing mail on reply/send and forward using a new config option sent-charset and he's planning to make this configurabe at the time of the message composition.

Q: Where is information on how to compile and configure it?

Bernhard Kaindl sent some info in this mail to pine-info

Q: How to write mails which include UTF-8 characters?

Unfortunately pico and the internal editor of pine are not internationalized.

There are several editors for Unix that support UTF-8 rather well. For instance, Vim 6.x is an excellent text editor with solid UTF-8 support. Emacs also supports UTF-8 and has interfaces to CJK input methods. Mike Fabian has put up a page about UTF-8 Internationalisation in GNU Emacs and XEmacs in his document on CJK(Chinese, Japanese, Korean) Support in SuSE Linux

Q: Who developed the PINE UTF-8 Patch?

Jungshik Shin developed the initial patch with the header conversion, charset and iconv aliases, and generic locale fixes For the message body conversion, he used display-filters.

Bernhard Kaindl updated the patch from 4.44 to 4.53, 4.55 and 4.56, added the message body conversion by writting an internal filter with iconv, cleaned up the code that fixed a bug and debugged some some other problems later.

On top of the last patch for 4.55, Jungshik Shin made additonal fixes for bugs in the message body translation and implemented the send-charset config option included in the iconv patch 7a and newer.

The send-charset config option is partly based on a patch from Eduardo Chappa who made it possible to tag outgoing messages with the value of alt-character-set instead of X-UNKNOWN when the charset is not recognized by Pine.

Conceptually, send-charset is rather different from alt-character-set patch because the former is not just for tagging outgoing messages but also for converting the message headers and the message body to send-charset from chararacter-set (terminal / display charset).

Eduardo's patch is described in detail at his great web site which contains up-to-date information about Pine's latest features and is the definive source of Pine patches!

Many thanks to him for providing this excellent web site and patches!

Q: What are the next goals?

Suggestions on further improvement are welcome. One of the next projects is to implement an control char filter which understands UTF-8 so it can filter all control characters which don't conform to the UTF-8 encoding without filtering UTF-8 bytes which look like control characters but are part of the UTF-8 encoding sequence.

Q: What about including it in other pine patches and the mainstream Pine?

There are still some rough edges and some features missing on which help is wanted:

Things which are wanted are:

The definition of UTF-8(UCS/Unicode Transformation Format 8) is found in Unicode and ISO 10646. A draft RFC on UTF-8 submitted to IETF is also a good reference. One of numerous implementations is found in Mozilla.

A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX is an excellent resource (with many screenshots) if you are looking for UTF-8 terminal emulators, editors, conversion and printing utilities, and fonts!


Last Updated 2003-06-08