gf-core/lib/src/translator/todo/check-dictionary.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<TITLE>Checking GF Translation Dictionaries</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>Checking GF Translation Dictionaries</H1>
<FONT SIZE="4"><I>Aarne Ranta</I></FONT><BR>
<FONT SIZE="4">April 2014</FONT>
</CENTER>

<P>
<B>News</B>
</P>
<P>
9/5 Link to the current status: <A HREF="https://docs.google.com/spreadsheets/d/1NuLRp86UPjd298LxjhCAGlHsoPypxKpcBJfDab0De90/edit#gid">https://docs.google.com/spreadsheets/d/1NuLRp86UPjd298LxjhCAGlHsoPypxKpcBJfDab0De90/edit#gid</A>=0
</P>
<P>
9/5/2014 Removed many bogus subcat's revealed by dictionary authors and by FrameNet. Please upgrade your TopDictionary from darcs or github!
</P>

<H2>Call for contributions: the generic translation dictionaries of GF</H2>

<P>
<B>Wanted</B>: manual checking of TopDictionary???.gf files in
<A HREF="http://www.grammaticalframework.org/lib/src/translator/todo">this directory</A>.
</P>
<P>
<B>Abstract syntax</B>: <A HREF="./TopDictionary.gf">TopDictionary</A>, the top-7000 English words from British National Corpus, as sorted by frequency
<A HREF="http://www.kilgarriff.co.uk/BNClists/lemma.num">here</A>.
</P>
<P>
<B>Usage</B>: part of the general translation dictionaries, used for instance in the
<A HREF="http://www.grammaticalframework.org/demos/translation.html">GF translation demo</A>. The full dictionaties are the Dictionary* modules
in the <A HREF="../">parent directory</A>.
</P>
<P>
<B>Who</B>: anyone with good knowledge of the target language and with reasonable knowledge of the GF resource grammar paradigms for it.
</P>

<H2>How to do it</H2>

<P>
Follow these steps for your language. For instance, ToCheckFre.gf, with Fre substituted for any language in this directory.
</P>

<OL>
<LI>Make sure to download the latest version of the file.
<LI>Make sure you can compile the original file:

<PRE>
    gf ToCheckFre.gf +RTS -K64M
</PRE>

<LI>Edit the <CODE>lin</CODE> rules line by line, starting from the beginning. Follow the guidelines in the next section.
<LI>Mark the last rule you edit with "---- END edits by AR", where AR is your initials.
<LI>Put, as the first line of the file, a comment indicating your last edited rule:

<PRE>
    ---- checked by AR till once_Adv in the BNC order
</PRE>

<LI>Make sure the resulting file compiles again.
<LI>Perform <CODE>diff</CODE> with the old and the new file, just to make sure your changes look reasonable.
<LI>Commit your edits into darcs, if you have access to it, or to GF Contributions, or by email to Aarne Ranta. In the last case,
  it is enough to send those lin rules that you have processed.
<LI>Inform the gf-dev list that you have done this.
</OL>

<P>
A reasonable batch of revisions is 500 words or more, which should be doable in less than 2 hours. To avoid conflicts and overlapping work,
don't spend more than one day on a batch of work.
</P>
<P>
The already split senses are explained <A HREF="../senses-in-Dictionary.txt">here</A>.
</P>

<H2>Guidelines</H2>

<P>
When editing a lin rule, do one of the following:
</P>

<UL>
<LI><B>accept the rule as it is</B>: remove the tail comment after the rule's terminating semicolon, if there is one. For example:

<PRE>
    lin maintain_V2 = mkV2 (mkV I.entretenir_V2) | mkV2 (mkV I.maintenir_V2) ; -- tocheck
</PRE>

   becomes

<PRE>
    lin maintain_V2 = mkV2 (mkV I.entretenir_V2) | mkV2 (mkV I.maintenir_V2) ;
</PRE>

  <UL>
  <LI>change the linearization, and if the result is OK for you, also deleting the comment. For example,

<PRE>
    lin obviously_Adv = variants{} ; --
</PRE>

  becomes

<PRE>
    lin obviously_Adv = mkAdv "évidemment" ;
</PRE>

  </UL>
<LI><B>suggest split of sense</B>: add a comment prefixed by "--- split" and more senses, explaining them. For example,

<PRE>
    lin labour_N = mkN "accouchement" masculine | mkN "ouvrage" masculine ; -- tocheck
</PRE>

  might become

<PRE>
    lin labour_N = mkN "travail" "travaux" masculine ; --- mkN "accouchement" childbirth labour
</PRE>

  To check the meanings of senses that have already been split (by using numbers, e.g. <CODE>time_1_N</CODE>), look up the explanations in
  <A HREF="../Dictionary.gf">Dictionary.gf</A>.
<P></P>
<LI><B>report an anomaly</B>: change or leave the rule, and add a comment prefixed by "---- ". For example,

<PRE>
    lin back_Adv = variants{} ;
</PRE>

  might become

<PRE>
    lin back_Adv = mkAdv "en retour" ; ---- no exact translation in Fre
</PRE>

<P></P>
<LI><B>report on bad subcategory instance</B> (a common special case of anomaly):
add a comment prefixed by "---- subcat" to say that the current verb subcat instance
doesn't make sense to you. This may happen since the subcategories have partly been automatically extracted. It is still good to
put a suitable verb in place. For example,

<PRE>
    lin come_VS = variants {} ;
</PRE>

  might become

<PRE>
    lin come_VS = mkVS I.venir_V ; ---- subcat
</PRE>

</UL>

<P>
As general guidelines,
</P>

<UL>
<LI><B>Don't just do nothing</B>, but do one of the things above, until the point where you quit checking.
<LI><B>Maintain the format</B> of one rule per line, prefixed by <CODE>lin</CODE>.
<LI><B>Put the most standard translation as the first variant</B>, deprecated and slang words later.
<LI><B>Make sure that the morphology comes out correct</B>.
<LI><B>Make sure to return correct verb subcategorization</B>.
<LI><B>Don't split senses</B> if the difference is very small, e.g. one in stylistic rather than semantic value.
</UL>

<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html check-dictionary.t2t -->
</BODY></HTML>