Add Chinese error message translations #3935

MichaelChirico · 2019-10-04T16:10:43Z

Something I've been meaning to figure out for years & finally figured out. The hard part (setting up the po directory) is done -- the rest is the manual work of actually translating strings.

Before proceeding, I wanted to file this and get some feedback about whether it's worth to proceed.

I see benefits:

Extremely welcoming to the Chinese-speaking R community which is quite large and counts among its members several of our biggest contributors & advocates. Of course they (i.e. the contributors) all speak English just fine & understand the current messaging, but they may be (hopefully I'm not presuming too much here) more productive if they can get native feedback without the extra level of translation.
As near as I can tell, we would be the only of the top 100 or so R packages to offer translation to any language (none of them have the required po directory). This would make data.table a leader in accessibility and openness in yet another dimension for the R community.

Costs:

Maintenance burden -- should be updating the main files as new errors/messages are added in R/ and src/ going forward. Hopefully minimal, but it's new territory -- we'd be one of the only major packages stress testing the tools R provides here. See steps below for a bit more detail.
Package size -- We're already lapping at the 5MB CRAN limit basically every time we submit recently. I don't think the po/ directory is included in the built package but it would be 1000s of lines if so. The compressed file added to inst/ is definitely included but is only 500B as of now... for comparison, the entirety of r-source has about 6.3Mb of .mo files (sum(file.info(list.files('~/github/r-source', pattern = '.mo$', recursive = TRUE, full.names = TRUE))$size)/1e6)... but I think 5Kb is a very conservative estimate of how big our file could get even with several other translations included.
Potential trouble with new issues filed with unknown messages -- solution is to provide instructions in the Issues template for reporters on how to convert their messages back to English, but not clear how effective this would be

Any other missed considerations please file below and/or edit into this post.

cc @renkun-ken @shrektan @dracodoc -- if you're willing to put any time / effort into filling out some translations please signal so here as well 😄 If we move forward we can also post on Twitter where I know we have several more Chinese/Chinese diaspora followers.

The last consideration is -- why Chinese?

It's really about perceived utility & popularity. We can of course post a poll on Twitter before proceeding (or more objectively, does CRAN offer statistics on IPs downloading?); the other two choices would seem to be Spanish or Japanese.

For my own reference as much as anything, here are some notes/steps to getting the package to a portable state (from a package with no portability, i.e. no po directory). WRE has a few sections but I read them many times before figuring this all out (data.table can always be replaced by a generic other package below):

In root directory, run tools::update_pkg_po('.'). This will create the initial po folder and the po/R-data.table.pot file which has extracts of all arguments to stop/error/message used in the R directory.
Portability for messages in src is a lot more manual -- any message that's to be translated needs to be wrapped to the macro function _ (e.g. error("this error") is edited to error(_("this error"))), then we run xgettext --keyword=_ -o data.table.pot *.c, which will create a file analogous to R-data.table.pot for C-level messages. Some useful command line stuff for this:

# I ran this for loop to append `_`
for SRC_FILE in src/*.c;
  # no inplace -i in default mac sed
  do sed -E 's/error[(]("[^"]*")/error(_(\1)/g' $SRC_FILE > out;
  mv out $SRC_FILE;
done
# checking for other lines calling error() that didn't get _()-wrapped
grep -Er '\berror[(]' src --include=*.c | grep -v _ | grep -Ev '(?://|[*]).*error[(]'

# ditto for warning, DTWARN, DTPRINT, Rprintf, STOP, Error, mutatis mutandis
#   a bit more manual was snprintf usage
# NB that this approach didn't work perfectly, there are a few edge cases that I caught when
#   installation failed & had to fix manually:
#   * quote embedded (and escaped) within message [could be fixed with smarter regex]
#   * multi-line implicit-concat arrays (in C, `"a" "b"` is the same as `"ab"`) should be wrapped "on the outside" not individually
#   * `data.table` shares some of its `src` with `pydatatable`, so the requirement to `#include <R.h>` before the `#define _` macro meant we need to be careful about including this macro only in the R headers for these files (hence I created `po.h`)
#   * Can't use `_()` _inside_ another functional macro. Only wrap the string passed to the macro later.

# look for char array that haven't been covered yet
grep -Er '"[^"]+"' src --include=*.c | grep -Fv '_("' | grep -v "#include" | grep -v '//.*".*"'

# look for lines starting with a char array (likely continued from prev line & can be combined)
grep -Er '^\s*"' src/*.c

Create a .po file in po/ corresponding to a target language, e.g. touch po/R-zh_CN.po for Mandarin. R- means it corresponds to the R messages; see the po/ directories in e.g. r-source/src/library/{base,stats,utils,tools} for locales already in use by base, or see the R manual for a more official documentation of these codes.
Add a header to the file like so (need to edit PO-Revision-Date, Last-Translator, Language-Team, Language and the blank msgid & msgstr lines are necessary):

msgid ""
msgstr ""
"Project-Id-Version: data.table 1.12.5\n"
"POT-Creation-Date: 2019-10-04 17:06\n"
"PO-Revision-Date: 2019-10-04 17:06+08\n"
"Last-Translator: Michael Chirico <[email protected]>\n"
"Language-Team: Mandarin\n"
"Language: Mandarin\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

Run tools::update_pkg_po('.') again; this time, the captured messages and blank translations will be added to your .po file.
Add translations where the blank strings are in the .po. Each string corresponds to an argument of e.g. stop, so e.g. stop("hi", "you", "broke", "it") in R/ would create 4 entries in the .pot&.po files. This may make translation more difficult if the target language has sufficiently different grammar that the components can't be made to line up in order & make sense. I think the solution here is to use gettext/ngettext but I'm not sure.
Save & run tools::update_pkg_po('.') again. This time, the .po file will be processed and the translations will be recorded in the .mo file in inst/po.
Reinstall data.table and run R in the correct domain. This can be done ad hoc by temporarily setting the LANGUAGE env, e.g. LANGUAGE=zh_CN.UTF-8 R --no-save will start R in Chinese without editing the locale on your machine more deeply. Then trigger a few of the messages to make sure the process was successful.

NB - there was a bug in tools leading update_pkg_po to fail that was finished (as of this writing) extremely recently. If you encounter this bug, either install bleeding-edge R-devel or simply assign the corrected version of tools:::en_quote and tools::update_pkg_po to your .GlobalEnv (as well as a few more functions internal to tools) & run from console instead of using the tools version.

I think it will also be helpful to add a note to the start-up message about where how to find help from non-English to English. One way is to provide a link to the .po files about how to convert the error message to English before proceeding. Or perhaps numbering errors? I'm not sure the best way here.

working version of chinese translation

dracodoc · 2019-10-04T22:35:47Z

I would love to contribute on my favorite and most used package!
I will comment more when I got time to look at the details.

shrektan · 2019-10-05T01:26:46Z

No problem I am at your command

renkun-ken · 2019-10-05T02:16:47Z

Thanks for your research on this. I'd love to contribute!

dracodoc · 2019-10-07T21:51:47Z

About the package size, will it be possible to separate language files into separate download?

Many software have this design, by default only load 1-2 languages, then use can download language files.
If we host the language file outside of CRAN, all the languages will not cause add any burden. It's also more friendly to users that they don't need to download unneeded languages.
After the default package installation, there could be a multiple language prompt that tell user how to use a helper function to download language files.

dracodoc · 2019-10-07T21:58:28Z

I haven't get time to look at how this is implemented yet, though if the translation task do need multiple human hours, it will be ideal that the translation task can be distributed to volunteer translators that with minimal setup requirement (I'm expecting translators should at least quite familiar with R and data.table, but the steps mentioned above can be a little bit overwhelming).

Is it possible to create a table and translators can just fill in the table? One organizer can distribute the tasks to translators, standardize the frequently used words/concepts. This will make the translation part much easier.

Then there could be some script to read the table and create the needed files.

I'm not sure if this is possible or easy to implement, since I haven't look at the details.

MichaelChirico · 2019-10-08T01:42:46Z

Yep, that's basically what I have in mind! First I want to reach out on Twitter for more potential translators, just waiting to hear some first reactions from @jangorecki & @mattdowle

mattdowle · 2019-10-08T06:06:09Z

Sounds good to me! In #3937 I reduced installed size down by 1MB to about 4.0MB. You mentioned a few size figures for po, one being just 500 bytes so even at 100x that should be fine. If we had multiple languages, a user typically (by default) doesn't install all translations I assume, so even if CRAN size checks do count total size of all languages in po it's a problem we can raise with them later if and when we reach it.

jangorecki · 2019-10-08T09:00:18Z

Idea is good.
English should generally be preferred, it is much easier to google for error messages, or sometimes even understand the message in English. I recall it was annoying when I had to configure R to english rather than polish.
I think the idea is good because in some countries English has quite low adoption, even among technical users. Chinese, Russian, Spanish would be definitely useful here.

MichaelChirico · 2019-10-08T09:12:36Z

it is much easier to google for error messages

Yes, this is a sticking point. I'm not sure the best way around this. Right now, I think a good approach is to add a new start-up message that can point non-native users to the .po files to look up the English message for Googling & filing issues?

Open to other suggestions here.

renkun-ken · 2019-10-08T10:23:14Z

Another direction I think would be helpful is to provide documentation, vignettes, and error message table in multi-languages online (in data table website, github wiki, etc.). This completely does not rely on how R implements multi-lingual support and would not add package size at all, and is much easier to edit and contribute.

dracodoc · 2019-10-08T10:57:27Z

For Google search error message, is it practical to assign an error number to every message?
Sometimes same error can have different wordings among versions which also make search harder. Having a fixed error number will also help on this case.

MichaelChirico · 2019-10-13T04:10:41Z

OK here are some of my thoughts for how to proceed.

I prefer to keep pretty focused for now. As mentioned we're the first major R package to try this so we're in unexplored territory -- I don't want to venture too far outside the status quo for package dev while we figure out what are the tradeoffs of adding non-English support. So, as much as I think things like a website, wiki, etc. added in Chinese are quite useful, let's pause on that for now.

Re: adding numbered messages, I lean towards not doing it. I'm not sure how to scale that or how practical it is to do at this stage. (1) 15+ years of data.table help is available online with no numbering (2) adding numbering to every error message would make the diff of this PR much harder to sift through (3) should we be numbering R & C errors separately? (4) Suppose we came up with a numbering system for error messages today. How do we number the next error that gets added? Will we have to keep track of the current error number somewhere? What about when multiple PRs are adding different errors, does the maintainer (i.e., Matt) have to de-conflict the error numbers and guard against duplicates there too? (I have in mind this similar endeavour where duplications crept in to our manual test numbering system over the years: #3539 And that is all in a single file! Much harder to do over many scripts and between R & C code). So, I think the more practical thing to do is to add to the start-up message some instructions in native language on how to find the English version of errors/messages.

Lastly, I think that coordinating everything here on GitHub will get inconvenient/impractical very quickly. Instead, I propose to use a (through-now dormant) Slack channel to simplify things. This is for convenience/dev speed, however, I want to keep as transparent as possible with this (1) because of the importance of transparency in FOSS in general and (2) because such archive can serve quite useful to other R teams/packages considering going a similar route in the future. Therefore, all chats should happen in public channels, and at the conclusion (or maybe regularly -- if the volume of messages is large, we'll have to download occasionally given the free messaging limits in Slack), I'll post the full archive here as a textfile.

Current team of volunteers is: @renkun-ken @shrektan @dracodoc @hongyuanjia @Leo-Lee15 @amy17519 @zhiiiyang @GuangchuangYu

Please contact me at my gmail address: MichaelChirico4 and I'll invite you to the channel (rdatatable.slack).

Thanks so much everyone! It's been very encouraging to see the outpouring of support :)

jangorecki · 2019-10-13T09:36:18Z

Where is _() C function defined? would be useful to separate big changes into different commits, so changes to all error( calls separately from _() definition, even separately from pot files. Then it is easy to filter out changes to review. Do you have a minimal working example somewhere? Trying to make it work on DT from start might be much more complex than first having a minimal example. A dummy pkg having one error from R, one error from C, and translations for them.

MichaelChirico · 2019-10-14T00:52:41Z

Where is _() C function defined?

This latest commit was simply applying section 1.8.1 of WRE:

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#C_002dlevel-messages

I've added this step to the instructions above. I think this makes sense as a single commit, though it does still need touch-up

jangorecki · 2019-10-15T17:46:58Z

@MichaelChirico I don't think we should be translating verbose messages, errors and warnings should be enough.

MichaelChirico · 2019-10-15T17:47:15Z

src/dogroups.c

-  // starts can now be NA (<0): if (INTEGER(starts)[0]<0 || INTEGER(lens)[0]<0) error("starts[1]<0 or lens[1]<0");
-  if (!isNull(jiscols) && LENGTH(order) && !LOGICAL(on)[0]) error("Internal error: jiscols not NULL but o__ has length"); // # nocov
-  if (!isNull(xjiscols) && LENGTH(order) && !LOGICAL(on)[0]) error("Internal error: xjiscols not NULL but o__ has length"); // # nocov
-  if(!isEnvironment(env)) error("’env’ should be an environment");


angled quotes are non-ASCII, failed xgettext

* finish the Chinese message translation for group 17 * change as KingdaShi suggests Co-authored-by: Michael Chirico <[email protected]>

* message * message * added hongyuans review Co-authored-by: Michael Chirico <[email protected]>

* done some translation! * done some translation * messsage * messsage Co-authored-by: Michael Chirico <[email protected]>

MichaelChirico · 2019-12-29T14:02:26Z

src/froll.c

 */
 void frollmeanFast(double *x, uint64_t nx, ans_t *ans, int k, double fill, bool narm, int hasna, bool verbose) {
  if (verbose)
-    snprintf(end(ans->message[0]), 500, "frollmeanFast: running for input length %"PRIu64", window %d, hasna %d, narm %d\n", (uint64_t)nx, k, hasna, (int)narm);


Treat this more as a template so the message doesn't need to be translated redundantly (see elsewhere same file as well as frolladaptive)

* covering trivial translations from among the leftovers * progress on paring down leftover translations * completed translations!!! * adjustments

MichaelChirico · 2019-12-30T01:56:30Z

R/data.table.R

        }
        tt = vapply_1i(byval,length)
-        if (any(tt!=xnrow)) stop("The items in the 'by' or 'keyby' list are length (",paste(tt,collapse=","),"). Each must be length ", xnrow, "; the same length as there are rows in x (after subsetting if i is provided).")
+        if (any(tt!=xnrow)) stop(gettextf("The items in the 'by' or 'keyby' list are length(s) (%s). Each must be length %d; the same length as there are rows in x (after subsetting if i is provided).", paste(tt, collapse=","), xnrow, domain='R-data.table'))


Converted some messages that are awkward to translate in chunks into one formatted message to be translated as one.

This should probably be done more broadly but it's pretty manual & Chinese matches grammar closely enough. If we ever support Japanese though, probably we'll need complete overhaul...

MichaelChirico · 2019-12-30T01:58:04Z

R/setkey.R

 # as opposed to DT[order(.)] where na.last=TRUE, to be consistent with base
 {
-  if (!is.data.frame(x)) stop("x must be a data.frame or data.table.")
+  if (!is.data.frame(x)) stop("x must be a data.frame or data.table")


The same message appears elsewhere without ., aligning means only one translation needed

MichaelChirico · 2019-12-30T02:40:03Z

Hey @mattdowle we are good to go! All translations completed 🎉

mattdowle · 2019-12-31T00:29:45Z

Under --as-cran there's now the following note on the overall tar.gz size :

* checking CRAN incoming feasibility ... NOTE
Size of tarball: 5129180 bytes

v1.12.8 tar.gz size was 4842680 bytes, just under the 5MB CRAN limit.

The tar.gz size is not reduced by the compression of the .Rraw files already done, because the tar.gz is already compressing everything. (The compression of the .Rraw files just keeps them compressed when the package is installed to alleviate the "installed size" notes seen previously.)

On next submission we'll see if this new tar.gz size note is just on submission, or whether CRAN checks displays it too. We could know that now if there's another package on CRAN with the same "size of tarball" note displayed by its CRAN checks page. Even if CRAN checks displays that note we'll just have to live with it. I hope the "CRAN ok" badge is still green ok if there are just notes.

mattdowle · 2019-12-31T00:33:05Z

I can't see a NEWS item. Maybe you can add one in a follow-up PR. I imagine it'll be quite a big item thanking all the translators too. It's a huge piece of work so maybe you could convey that somehow with some statistics about the number of messages translated at R and C level, how long it took, number of commits, number of files and lines touched, etc.

mattdowle · 2019-12-31T00:50:48Z

src/frolladaptive.c

  if (verbose)
-    snprintf(end(ans->message[0]), 500, "fadaptiverollmeanFast: running for input length %"PRIu64", hasna %d, narm %d\n", (uint64_t)nx, hasna, (int) narm);
+    snprintf(end(ans->message[0]), 500, _("%s: running for input length %"PRIu64", hasna %d, narm %d\n"), "fadaptiverollmeanFast", (uint64_t)nx, hasna, (int) narm);
  bool truehasna = hasna>0;                                     // flag to re-run if NAs detected


Ok understood re this change. Will be interesting if this passes Rtools ATC re PR #4073 and the discussion here: #4073 (comment)

I ran it through win-builder ATC, and yes it passes with no warnings. So this means it was something about the __fun__ and not the %s before PRIu64. I'll add a link to the previous discussion pointing here too.

awesome, glad to hear

MichaelChirico · 2019-12-31T02:39:26Z

Yep indeed I missed the NEWS item 🤦‍♂ will add a follow-up with tags for all the contributors as well as a summary.

initial exploration of adding a chinese translation of messages

f89c370

working version of chinese translation

MichaelChirico added the WIP label Oct 4, 2019

adding another .mo file that was untracked

e7208a4

trying to get the .pot for src/ files

12a169f

Michael Chirico added 8 commits October 16, 2019 00:45

adding warning()s from C

e10a714

C messages with DTPRINT or Rprintf

b987a1d

adding STOP messages

4d950a2

DTWARN messages

f64d93a

Error in forder

a30930d

odds&ends more messages

57eeace

cannot use _() with \r

9d596ac

update .pot for src files

710400a

MichaelChirico commented Oct 15, 2019

View reviewed changes

Michael Chirico added 2 commits October 16, 2019 01:54

moving pot file to po/

6ac71e4

update header metadata for src .pot

1c98859

shrektan and others added 8 commits December 28, 2019 16:00

Translation to zh_CN - Group 17 (#4049)

c9b49e5

* finish the Chinese message translation for group 17 * change as KingdaShi suggests Co-authored-by: Michael Chirico <[email protected]>

Merge branch 'master' into translation

a34c9c0

update mo, fix missing field

5094c46

Translation yc0802 (#4054)

3f8cefe

* message * message * added hongyuans review Co-authored-by: Michael Chirico <[email protected]>

adding "leftover" translations (#4143)

a9de5aa

Merge branch 'translation' into translation-kingda

4fc08a1

done some translation!-kingda (#4046)

2864ca6

* done some translation! * done some translation * messsage * messsage Co-authored-by: Michael Chirico <[email protected]>

Merge branch 'translation-kingda' into translation

67f3614

MichaelChirico commented Dec 29, 2019

View reviewed changes

Translation leftover pt2 (#4146)

b4cb616

* covering trivial translations from among the leftovers * progress on paring down leftover translations * completed translations!!! * adjustments

MichaelChirico commented Dec 30, 2019

View reviewed changes

Michael Chirico added 2 commits December 30, 2019 10:37

missing dot

8457d36

add note for proceeding to CRAN_Relase

a39acd9

MichaelChirico removed the WIP label Dec 30, 2019

MichaelChirico changed the title ~~WIP: Add Chinese error message translations~~ Add Chinese error message translations Dec 30, 2019

%f needed for > integer max

f975897

mattdowle reviewed Dec 31, 2019

View reviewed changes

mattdowle merged commit ff076a7 into master Dec 31, 2019

mattdowle deleted the translation branch December 31, 2019 01:05

mattdowle added a commit that referenced this pull request Dec 31, 2019

GLCI allow 1 NOTE for tarball size; #3935

62a7486

mattdowle mentioned this pull request Dec 31, 2019

mingw I64 warnings #4073

Merged

MichaelChirico mentioned this pull request Dec 31, 2019

add NEWS item for translation #4149

Merged

jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020

rikivillalba mentioned this pull request Jul 31, 2024

Add Brazilian Portuguese translation #6199

Merged

Add Chinese error message translations #3935

Add Chinese error message translations #3935

Uh oh!

Conversation

MichaelChirico commented Oct 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dracodoc commented Oct 4, 2019

Uh oh!

shrektan commented Oct 5, 2019

Uh oh!

renkun-ken commented Oct 5, 2019

Uh oh!

dracodoc commented Oct 7, 2019

Uh oh!

dracodoc commented Oct 7, 2019

Uh oh!

MichaelChirico commented Oct 8, 2019

Uh oh!

mattdowle commented Oct 8, 2019

Uh oh!

jangorecki commented Oct 8, 2019

Uh oh!

MichaelChirico commented Oct 8, 2019

Uh oh!

renkun-ken commented Oct 8, 2019

Uh oh!

dracodoc commented Oct 8, 2019

Uh oh!

MichaelChirico commented Oct 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Oct 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelChirico commented Oct 14, 2019

Uh oh!

jangorecki commented Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelChirico Oct 15, 2019

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

MichaelChirico commented Dec 30, 2019

Uh oh!

mattdowle commented Dec 31, 2019

Uh oh!

mattdowle commented Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdowle Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

mattdowle Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

MichaelChirico commented Dec 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

22 participants

MichaelChirico commented Oct 4, 2019 •

edited

Loading

MichaelChirico commented Oct 13, 2019 •

edited

Loading

jangorecki commented Oct 13, 2019 •

edited

Loading

jangorecki commented Oct 15, 2019 •

edited

Loading

MichaelChirico Dec 29, 2019 •

edited

Loading

mattdowle commented Dec 31, 2019 •

edited

Loading