KTH and Nada symbols

Suggestion for hyphenation indications in HTML - <HYPH>

Issued in May 1996 by Peter Svanberg psv@nada.kth.se and Olle Järnefors ojarnef@nada.kth.se . Zucker example revised 18 April 1997.

Summary

We suggest a new HTML element HYPH, in the general form

<HYPH BEF= "before_linebreak_string" AFT= "after_linebreak_string" > no_linebreak_string </HYPH>

to indicate where and how a word can be hyphenated. In the normal case it is reduced to <HYPH></HYPH> at the point where hyphenation may take place.

Motivation for hyphenation indication

The need for supporting hyphenation has been accentuated by recent developments in the WWW.
tables
When tables where introduced in HTML, the need for hyphenation greatly increased, as the columns which text should fit into became more narrow
a wider spectrum of languages
Some languages have a special need of hyphenation, as very long (compound) words are used quite frequently. Examples of this is the languages in the Scandinavian countries and the German language.

Server-side or client-side solution

In principle, hyphenation can be performed by the client program on its own or by the client program guided by information from the server about suitable points for hyphenation. Generally the latter approach seems easier to realize, because the information provider has the best qualifications for selecting the best points of hyphenation. He can use word processing programs specialized for the language he uses or the subject are covered.

A client program can not be expected to be competent as regards hyphenation for the many different languages that are used in the internationalized WWW. If the information provider can use HTML markup to indicate hyphenation points, no new functionality in server software is needed at all, and the extra program support for hyphenation in clients will be extremely limited.

Hyphenation certainly is a more demanding task for HTML documents than ordinary paper documents, produced by word processing, because the dynamic nature of word wrapping in HTML documents makes necessary the inclusion of hyphenation hints virtually everywhere in a paragraph, not only at the end of a few lines.

Current situation

The current situation in HTML is that the only possible way to specify to the client where a hyphenation can be done is by using the soft hyphen character. The RFC 1866, however, discourages its usage:
 NOTE - Use of the non-breaking space and soft hyphen indicator
        characters is discouraged because support for them is not
        widely deployed.
The more popular commercial client programs do not support the use of soft hyphen. Even worse, these implementations even sabotage its use (which was defined by ISO 8859-1 in 1987) by showing a visual hyphen for every soft hyphen character. Had they elected to show no symbol at all, it would have been possible to include this special kind of markup into HTML documents without so bad side effects.

The insufficiency of soft hyphen for i18n use

A more fundamental problem with soft hyphen is that it cannot represent hyphenation behaviour in some special cases in languages such as Swedish and German. To give some simple examples, in Swedish the word "tillaga", if hyphenated, becomes "till-
laga", i.e. an extra letter suddenly appears. In some German cases, the situation is even more complicated. The word "Zucker", when hyphenated properly, is transformed to "Zuk-
ker".

Finally, we would like to note that the hyphenation of certain words in these languages is dependent on the meaning of the word in its context, which makes an adequate client-side solution almost impossible.

Specification of element HYPH

The soft hyphen problems show that there is a need for a more backwards compatible and more general solution for indicating possible hyphenation points than using the specific character soft hyphen. It is also an advantage if that solution is very simple to implement in browsers. We have a proposal that meets all three of these requirements:

A new element HYPH to specify hyphenation points should be introduced. Example:

internationa<HYPH></HYPH>lization

An old HTML browser will show the full word "internationalization". A browser implementing this proposal can use the indicated point to hyphenate this word, if that would enhance the appearance of the current paragraph.

To handle the special cases mentioned above, the following three attributes can be used:

BEF
Gives the string to insert before the line break if hyphenation is performed. Default value (if unspecified) is "-".
AFT
Gives the string to insert after the line break. Default value (if unspecified) is "".
SUBST
Identifies the hyphenation character in BEF, if other than the default "-". This makes it possible for the browser to substitute this hyphenation character by a special hyphenation character preferred by the user.
If there is text inside the HYPH element, it is displayed only if hyphenation is not done.

(This functionality is inspired by the discretionary function in the text formatting and typesetting system TeX.)

Here is a more complex example to illustrate the use of the attributes:

The correct hyphenation behaviour of the German word "Zucker" can be specified in this way:

Zu<HYPH BEF="k-">c</HYPH>ker

Note that in future usage, based on ISO/IEC 10646 as character set, there is two characters in this context: the ASCII character HYPHEN-MINUS (hex 002D) and HYPHEN (hex 2016, decimal 8208). The latter should normally be used for hyphenation.

Test file

This suggestion is tested in this test text.
Latest update April 18, 1997 <webmaster@nada.kth.se>