gf-core/download/encoding-change.t2t

GF character encoding changes
Thomas Hallgren
2013-12-18

==Changes to character encodings in GF grammar files ==

Between the release of GF 3.5 and the next version, two changes were made
relating to character encodings in GF grammar files:

+ The default character encoding was changed from Latin-1
(also known as iso-8859-1, cp1252) to UTF-8.

+ They way you specify alternate character encodings was changed. Instead of
  using a ``flags coding = ...`` declaration in the source file, you should now
  use a pragma ``--# -coding=...`` at the top of the file instead.


== Advantages ==

UTF-8 is the default encoding for text files on many systems these days, so
it makes sense to use it as the default for GF grammar files too.

Changing how alternate encodings are specified allows conversion to Unicode
to be done before parsing, which means that

- we can allow Unicode characters in identifiers, not just in string literals,
- it makes accurate column positions in error messages possible,
- and (an implementation detail) we can use Alex to generate the lexer again.


== How are my grammar files affected? ==

If your files still compile without errors after the change, you don't need
to do anything. (But see Known problems below!)
If you get one of the following errors,

- ``lexical error``,
- ``encoding mismatch``,
- ``Warning: default encoding has changed from Latin-1 to UTF-8``


 you need to add a
``--# -coding=...`` pragma to your file (or convert it to UTF-8).

- For files containing only ASCII characters, no change is needed.
- For files encoded in UTF-8 (and thus using a ``flags coding=utf8``
  declaration), no change is needed.
- For files containing Latin-1 characters (e.g. characters like
  å ä ö ü é), add a ``#-- -coding=latin1`` pragma at the top of the file.
- For files using other encodings, copy the encoding specified in the
  ``flags coding=``//enc// to a corresponding ``--# -coding=``//enc//.


Grammars will still compile with GF-3.5 after these changes.


Note that GF only understands one option per pragma line. If you already
have a ``--path=...`` pragma, you can not put the ``-coding=...`` option on
the same line. Add it on a separate line:

```
	--# -path=...
	--# -coding=...
```

The recommendation for the future is to use UTF-8 for all source files.


== Known problems ==

The intention is that if a grammar file is affected by the changed default
encoding, then you will see one of the messages listed in the previous
section when you compile the grammar. But there are a couple if issues to be
aware of:

- Alex 3.0 seems to be confused about the length of matched strings sometimes.
  This can cause it to skip more than one line when it encounters a one-line
  comment in a grammar file with character encoding problems. So instead of a
  lexical error in the comment, you can get an odd syntax error
  on a subsequent line.

- If you explicitly specify -coding=utf8 for a file that is not in UTF-8, you
  will not get an error, because the UTF-8 decoding function we currently use is
  forgiving, substituting the Unicode replacement character <20>, instead of
  reporting an error. Hopefully, we will be able to change this.