mirror of
https://github.com/GrammaticalFramework/gf-core.git
synced 2026-04-09 13:09:33 -06:00
86 lines
3.1 KiB
Plaintext
86 lines
3.1 KiB
Plaintext
GF character encoding changes
|
||
Thomas Hallgren
|
||
2013-12-18
|
||
|
||
==Changes to character encodings in GF grammar files ==
|
||
|
||
Between the release of GF 3.5 and the next version, two changes were made
|
||
relating to character encodings in GF grammar files:
|
||
|
||
+ The default character encoding was changed from Latin-1
|
||
(also known as iso-8859-1, cp1252) to UTF-8.
|
||
|
||
+ They way you specify alternate character encodings was changed. Instead of
|
||
using a ``flags coding = ...`` declaration in the source file, you should now
|
||
use a pragma ``--# -coding=...`` at the top of the file instead.
|
||
|
||
|
||
== Advantages ==
|
||
|
||
UTF-8 is the default encoding for text files on many systems these days, so
|
||
it makes sense to use it as the default for GF grammar files too.
|
||
|
||
Changing how alternate encodings are specified allows conversion to Unicode
|
||
to be done before parsing, which means that
|
||
|
||
- we can allow Unicode characters in identifiers, not just in string literals,
|
||
- it makes accurate column positions in error messages possible,
|
||
- and (an implementation detail) we can use Alex to generate the lexer again.
|
||
|
||
|
||
== How are my grammar files affected? ==
|
||
|
||
If your files still compile without errors after the change, you don't need
|
||
to do anything. (But see Known problems below!)
|
||
If you get one of the following errors,
|
||
|
||
- ``lexical error``,
|
||
- ``encoding mismatch``,
|
||
- ``Warning: default encoding has changed from Latin-1 to UTF-8``
|
||
|
||
|
||
you need to add a
|
||
``--# -coding=...`` pragma to your file (or convert it to UTF-8).
|
||
|
||
- For files containing only ASCII characters, no change is needed.
|
||
- For files encoded in UTF-8 (and thus using a ``flags coding=utf8``
|
||
declaration), no change is needed.
|
||
- For files containing Latin-1 characters (e.g. characters like
|
||
å ä ö ü é), add a ``#-- -coding=latin1`` pragma at the top of the file.
|
||
- For files using other encodings, copy the encoding specified in the
|
||
``flags coding=``//enc// to a corresponding ``--# -coding=``//enc//.
|
||
|
||
|
||
Grammars will still compile with GF-3.5 after these changes.
|
||
|
||
|
||
Note that GF only understands one option per pragma line. If you already
|
||
have a ``--path=...`` pragma, you can not put the ``-coding=...`` option on
|
||
the same line. Add it on a separate line:
|
||
|
||
```
|
||
--# -path=...
|
||
--# -coding=...
|
||
```
|
||
|
||
The recommendation for the future is to use UTF-8 for all source files.
|
||
|
||
|
||
== Known problems ==
|
||
|
||
The intention is that if a grammar file is affected by the changed default
|
||
encoding, then you will see one of the messages listed in the previous
|
||
section when you compile the grammar. But there are a couple if issues to be
|
||
aware of:
|
||
|
||
- Alex 3.0 seems to be confused about the length of matched strings sometimes.
|
||
This can cause it to skip more than one line when it encounters a one-line
|
||
comment in a grammar file with character encoding problems. So instead of a
|
||
lexical error in the comment, you can get an odd syntax error
|
||
on a subsequent line.
|
||
|
||
- If you explicitly specify -coding=utf8 for a file that is not in UTF-8, you
|
||
will not get an error, because the UTF-8 decoding function we currently use is
|
||
forgiving, substituting the Unicode replacement character <20>, instead of
|
||
reporting an error. Hopefully, we will be able to change this.
|