1
0
forked from GitHub/gf-core

Document the upcoming default character encoding change in the release notes

This commit is contained in:
hallgren
2013-11-25 19:47:05 +00:00
parent 4a82481a1f
commit 9541668f76
2 changed files with 95 additions and 0 deletions

View File

@@ -0,0 +1,88 @@
GF character encoding changes
Thomas Hallgren
%%mtime(%F)
%!style:../css/style.css
%!postproc(html): <TITLE> <meta charset="UTF-8"><meta name = "viewport" content = "width = device-width"> <TITLE>
%!postproc(html): <H1> <H1><a href="../"><IMG src="../doc/Logos/gf0.png"></a>
==Planned changes to character encodings in GF grammar files ==
We plan to make two changes:
+ Currently the default character encoding in GF grammar files is Latin-1
(also known as iso-8859-1, cp1252). We plan to change the default to UTF-8.
+ It is currently possible to use another character encoding by specifying it
with a ``flags coding = ...`` declaration in the source file. We plan to change
this to use a pragma ``--# -coding=...`` at the top of the file instead.
== Advantages ==
UTF-8 is the default encoding for text files on many systems these days, so
it makes sense to use it as the default for GF grammar files too.
Changing how alternate encodings are specified allows conversion to Unicode
to be done before parsing, which means that
- we can allow Unicode characters in identifiers, not just in string literals,
- it makes accurate column positions in error messages possible,
- and (an implementation detail) we can use Alex to generate the lexer again.
== How are my grammar files affected? ==
If your files still compile without errors after the change, you don't need
to do anything. (But see Known problems below!)
If you get one of the following errors,
- ``lexical error``,
- ``encoding mismatch``,
- ``Warning: default encoding has changed from Latin-1 to UTF-8``
you need to add a
``--# -coding=...`` pragma to your file (or convert it to UTF-8).
- For files containing only ASCII characters, no change is needed.
- For files encoded in UTF-8 (and thus using a ``flags coding=utf8``
declaration), no change is needed.
- For files containing Latin-1 characters (e.g. characters like
å ä ö ü é), add a ``#-- -coding=latin1`` pragma at the top of the file.
- For files using other encodings, copy the encoding specified in the
``flags coding=``//enc// to a corresponding ``--# -coding=``//enc//.
Grammars will still compile with GF-3.5 after these changes.
Note that GF only understands one option per pragma line. If you already
have a ``--path=...`` pragma, you can not put the ``-coding=...`` option on
the same line. Add it on a separate line:
```
--# -path=...
--# -coding=...
```
The recommendation for the future is to use UTF-8 for all source files.
== Known problems ==
The intention is that if a grammar file is affected by the changed default
encoding, then you will see one of the messages listed in the previous
section when you compile the grammar. But there are a couple if issues to be
aware of:
- Alex 3.0 seems to be confused about the length of matched strings sometimes.
This can cause it to skip more than one line when it encounters a one-line
comment in a grammar file with character encoding problems. So instead of a
lexical error in the comment, you can get an odd syntax error
on a subsequent line.
- If you explicitly specify -coding=utf8 for a file that is not in UTF-8, you
will not get an error, because the UTF-8 decoding function we currently use is
forgiving, substituting the Unicode replacement character <20>, instead of
reporting an error. Hopefully, we will be able to change this.

View File

@@ -14,9 +14,15 @@ See the [download page http://www.grammaticalframework.org/download/index.html].
Over [...] changes have been pushed to the source repository since the release
of GF 3.5.
Closed issues: 30, 41, 57, 60, 61, 68.
===GF compiler and run-time library===
- The default character encoding in grammar files has been changed from
Latin-1 to UTF-8. Also, alternate character encodings should now be specified
as ``--# -coding=``//enc//, instead of ``flags coding=``//enc//.
See the separate document
[GF character encoding changes encoding-change.html] for more details.
- Nonlinear patterns (i.e., patterns where the same variable appears more than
once) in concrete syntax are now detected and reported as errors.
(Section C.4.13 in the GF book explicitly states that patterns must be
@@ -37,6 +43,7 @@ of GF 3.5.
(see the [updated synopsis ../lib/doc/synopsis.html]).
- [...]
===GF Cloud services===
- [...]