Skip to content

Conversation

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Oct 4, 2019

Something I've been meaning to figure out for years & finally figured out. The hard part (setting up the po directory) is done -- the rest is the manual work of actually translating strings.

Before proceeding, I wanted to file this and get some feedback about whether it's worth to proceed.

I see benefits:

  • Extremely welcoming to the Chinese-speaking R community which is quite large and counts among its members several of our biggest contributors & advocates. Of course they (i.e. the contributors) all speak English just fine & understand the current messaging, but they may be (hopefully I'm not presuming too much here) more productive if they can get native feedback without the extra level of translation.
  • As near as I can tell, we would be the only of the top 100 or so R packages to offer translation to any language (none of them have the required po directory). This would make data.table a leader in accessibility and openness in yet another dimension for the R community.

Costs:

  • Maintenance burden -- should be updating the main files as new errors/messages are added in R/ and src/ going forward. Hopefully minimal, but it's new territory -- we'd be one of the only major packages stress testing the tools R provides here. See steps below for a bit more detail.
  • Package size -- We're already lapping at the 5MB CRAN limit basically every time we submit recently. I don't think the po/ directory is included in the built package but it would be 1000s of lines if so. The compressed file added to inst/ is definitely included but is only 500B as of now... for comparison, the entirety of r-source has about 6.3Mb of .mo files (sum(file.info(list.files('~/github/r-source', pattern = '.mo$', recursive = TRUE, full.names = TRUE))$size)/1e6)... but I think 5Kb is a very conservative estimate of how big our file could get even with several other translations included.
  • Potential trouble with new issues filed with unknown messages -- solution is to provide instructions in the Issues template for reporters on how to convert their messages back to English, but not clear how effective this would be

Any other missed considerations please file below and/or edit into this post.

cc @renkun-ken @shrektan @dracodoc -- if you're willing to put any time / effort into filling out some translations please signal so here as well 😄 If we move forward we can also post on Twitter where I know we have several more Chinese/Chinese diaspora followers.

The last consideration is -- why Chinese?

It's really about perceived utility & popularity. We can of course post a poll on Twitter before proceeding (or more objectively, does CRAN offer statistics on IPs downloading?); the other two choices would seem to be Spanish or Japanese.


For my own reference as much as anything, here are some notes/steps to getting the package to a portable state (from a package with no portability, i.e. no po directory). WRE has a few sections but I read them many times before figuring this all out (data.table can always be replaced by a generic other package below):

  1. In root directory, run tools::update_pkg_po('.'). This will create the initial po folder and the po/R-data.table.pot file which has extracts of all arguments to stop/error/message used in the R directory.
  2. Portability for messages in src is a lot more manual -- any message that's to be translated needs to be wrapped to the macro function _ (e.g. error("this error") is edited to error(_("this error"))), then we run xgettext --keyword=_ -o data.table.pot *.c, which will create a file analogous to R-data.table.pot for C-level messages. Some useful command line stuff for this:
# I ran this for loop to append `_`
for SRC_FILE in src/*.c;
  # no inplace -i in default mac sed
  do sed -E 's/error[(]("[^"]*")/error(_(\1)/g' $SRC_FILE > out;
  mv out $SRC_FILE;
done
# checking for other lines calling error() that didn't get _()-wrapped
grep -Er '\berror[(]' src --include=*.c | grep -v _ | grep -Ev '(?://|[*]).*error[(]'

# ditto for warning, DTWARN, DTPRINT, Rprintf, STOP, Error, mutatis mutandis
#   a bit more manual was snprintf usage
# NB that this approach didn't work perfectly, there are a few edge cases that I caught when
#   installation failed & had to fix manually:
#   * quote embedded (and escaped) within message [could be fixed with smarter regex]
#   * multi-line implicit-concat arrays (in C, `"a" "b"` is the same as `"ab"`) should be wrapped "on the outside" not individually
#   * `data.table` shares some of its `src` with `pydatatable`, so the requirement to `#include <R.h>` before the `#define _` macro meant we need to be careful about including this macro only in the R headers for these files (hence I created `po.h`)
#   * Can't use `_()` _inside_ another functional macro. Only wrap the string passed to the macro later.

# look for char array that haven't been covered yet
grep -Er '"[^"]+"' src --include=*.c | grep -Fv '_("' | grep -v "#include" | grep -v '//.*".*"'

# look for lines starting with a char array (likely continued from prev line & can be combined)
grep -Er '^\s*"' src/*.c
  1. Create a .po file in po/ corresponding to a target language, e.g. touch po/R-zh_CN.po for Mandarin. R- means it corresponds to the R messages; see the po/ directories in e.g. r-source/src/library/{base,stats,utils,tools} for locales already in use by base, or see the R manual for a more official documentation of these codes.
  2. Add a header to the file like so (need to edit PO-Revision-Date, Last-Translator, Language-Team, Language and the blank msgid & msgstr lines are necessary):
msgid ""
msgstr ""
"Project-Id-Version: data.table 1.12.5\n"
"POT-Creation-Date: 2019-10-04 17:06\n"
"PO-Revision-Date: 2019-10-04 17:06+08\n"
"Last-Translator: Michael Chirico <[email protected]>\n"
"Language-Team: Mandarin\n"
"Language: Mandarin\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
  1. Run tools::update_pkg_po('.') again; this time, the captured messages and blank translations will be added to your .po file.

  2. Add translations where the blank strings are in the .po. Each string corresponds to an argument of e.g. stop, so e.g. stop("hi", "you", "broke", "it") in R/ would create 4 entries in the .pot&.po files. This may make translation more difficult if the target language has sufficiently different grammar that the components can't be made to line up in order & make sense. I think the solution here is to use gettext/ngettext but I'm not sure.

  3. Save & run tools::update_pkg_po('.') again. This time, the .po file will be processed and the translations will be recorded in the .mo file in inst/po.

  4. Reinstall data.table and run R in the correct domain. This can be done ad hoc by temporarily setting the LANGUAGE env, e.g. LANGUAGE=zh_CN.UTF-8 R --no-save will start R in Chinese without editing the locale on your machine more deeply. Then trigger a few of the messages to make sure the process was successful.

NB - there was a bug in tools leading update_pkg_po to fail that was finished (as of this writing) extremely recently. If you encounter this bug, either install bleeding-edge R-devel or simply assign the corrected version of tools:::en_quote and tools::update_pkg_po to your .GlobalEnv (as well as a few more functions internal to tools) & run from console instead of using the tools version.


I think it will also be helpful to add a note to the start-up message about where how to find help from non-English to English. One way is to provide a link to the .po files about how to convert the error message to English before proceeding. Or perhaps numbering errors? I'm not sure the best way here.

@dracodoc
Copy link
Contributor

dracodoc commented Oct 4, 2019

I would love to contribute on my favorite and most used package!
I will comment more when I got time to look at the details.

@shrektan
Copy link
Member

shrektan commented Oct 5, 2019

No problem I am at your command

@renkun-ken
Copy link
Member

Thanks for your research on this. I'd love to contribute!

@dracodoc
Copy link
Contributor

dracodoc commented Oct 7, 2019

About the package size, will it be possible to separate language files into separate download?

  • Many software have this design, by default only load 1-2 languages, then use can download language files.
  • If we host the language file outside of CRAN, all the languages will not cause add any burden. It's also more friendly to users that they don't need to download unneeded languages.
  • After the default package installation, there could be a multiple language prompt that tell user how to use a helper function to download language files.

@dracodoc
Copy link
Contributor

dracodoc commented Oct 7, 2019

I haven't get time to look at how this is implemented yet, though if the translation task do need multiple human hours, it will be ideal that the translation task can be distributed to volunteer translators that with minimal setup requirement (I'm expecting translators should at least quite familiar with R and data.table, but the steps mentioned above can be a little bit overwhelming).

Is it possible to create a table and translators can just fill in the table? One organizer can distribute the tasks to translators, standardize the frequently used words/concepts. This will make the translation part much easier.

Then there could be some script to read the table and create the needed files.

I'm not sure if this is possible or easy to implement, since I haven't look at the details.

@MichaelChirico
Copy link
Member Author

Yep, that's basically what I have in mind! First I want to reach out on Twitter for more potential translators, just waiting to hear some first reactions from @jangorecki & @mattdowle

@mattdowle
Copy link
Member

Sounds good to me! In #3937 I reduced installed size down by 1MB to about 4.0MB. You mentioned a few size figures for po, one being just 500 bytes so even at 100x that should be fine. If we had multiple languages, a user typically (by default) doesn't install all translations I assume, so even if CRAN size checks do count total size of all languages in po it's a problem we can raise with them later if and when we reach it.

@jangorecki
Copy link
Member

Idea is good.
English should generally be preferred, it is much easier to google for error messages, or sometimes even understand the message in English. I recall it was annoying when I had to configure R to english rather than polish.
I think the idea is good because in some countries English has quite low adoption, even among technical users. Chinese, Russian, Spanish would be definitely useful here.

@MichaelChirico
Copy link
Member Author

it is much easier to google for error messages

Yes, this is a sticking point. I'm not sure the best way around this. Right now, I think a good approach is to add a new start-up message that can point non-native users to the .po files to look up the English message for Googling & filing issues?

Open to other suggestions here.

@renkun-ken
Copy link
Member

Another direction I think would be helpful is to provide documentation, vignettes, and error message table in multi-languages online (in data table website, github wiki, etc.). This completely does not rely on how R implements multi-lingual support and would not add package size at all, and is much easier to edit and contribute.

@dracodoc
Copy link
Contributor

dracodoc commented Oct 8, 2019

For Google search error message, is it practical to assign an error number to every message?
Sometimes same error can have different wordings among versions which also make search harder. Having a fixed error number will also help on this case.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Oct 13, 2019

OK here are some of my thoughts for how to proceed.

I prefer to keep pretty focused for now. As mentioned we're the first major R package to try this so we're in unexplored territory -- I don't want to venture too far outside the status quo for package dev while we figure out what are the tradeoffs of adding non-English support. So, as much as I think things like a website, wiki, etc. added in Chinese are quite useful, let's pause on that for now.

Re: adding numbered messages, I lean towards not doing it. I'm not sure how to scale that or how practical it is to do at this stage. (1) 15+ years of data.table help is available online with no numbering (2) adding numbering to every error message would make the diff of this PR much harder to sift through (3) should we be numbering R & C errors separately? (4) Suppose we came up with a numbering system for error messages today. How do we number the next error that gets added? Will we have to keep track of the current error number somewhere? What about when multiple PRs are adding different errors, does the maintainer (i.e., Matt) have to de-conflict the error numbers and guard against duplicates there too? (I have in mind this similar endeavour where duplications crept in to our manual test numbering system over the years: #3539 And that is all in a single file! Much harder to do over many scripts and between R & C code). So, I think the more practical thing to do is to add to the start-up message some instructions in native language on how to find the English version of errors/messages.

Lastly, I think that coordinating everything here on GitHub will get inconvenient/impractical very quickly. Instead, I propose to use a (through-now dormant) Slack channel to simplify things. This is for convenience/dev speed, however, I want to keep as transparent as possible with this (1) because of the importance of transparency in FOSS in general and (2) because such archive can serve quite useful to other R teams/packages considering going a similar route in the future. Therefore, all chats should happen in public channels, and at the conclusion (or maybe regularly -- if the volume of messages is large, we'll have to download occasionally given the free messaging limits in Slack), I'll post the full archive here as a textfile.

Current team of volunteers is: @renkun-ken @shrektan @dracodoc @hongyuanjia @Leo-Lee15 @amy17519 @zhiiiyang @GuangchuangYu

Please contact me at my gmail address: MichaelChirico4 and I'll invite you to the channel (rdatatable.slack).

Thanks so much everyone! It's been very encouraging to see the outpouring of support :)

@jangorecki
Copy link
Member

jangorecki commented Oct 13, 2019

Where is _() C function defined? would be useful to separate big changes into different commits, so changes to all error( calls separately from _() definition, even separately from pot files. Then it is easy to filter out changes to review. Do you have a minimal working example somewhere? Trying to make it work on DT from start might be much more complex than first having a minimal example. A dummy pkg having one error from R, one error from C, and translations for them.

@MichaelChirico
Copy link
Member Author

Where is _() C function defined?

This latest commit was simply applying section 1.8.1 of WRE:

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#C_002dlevel-messages

I've added this step to the instructions above. I think this makes sense as a single commit, though it does still need touch-up

@jangorecki
Copy link
Member

jangorecki commented Oct 15, 2019

@MichaelChirico I don't think we should be translating verbose messages, errors and warnings should be enough.

// starts can now be NA (<0): if (INTEGER(starts)[0]<0 || INTEGER(lens)[0]<0) error("starts[1]<0 or lens[1]<0");
if (!isNull(jiscols) && LENGTH(order) && !LOGICAL(on)[0]) error("Internal error: jiscols not NULL but o__ has length"); // # nocov
if (!isNull(xjiscols) && LENGTH(order) && !LOGICAL(on)[0]) error("Internal error: xjiscols not NULL but o__ has length"); // # nocov
if(!isEnvironment(env)) error("’env should be an environment");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

angled quotes are non-ASCII, failed xgettext

shrektan and others added 8 commits December 28, 2019 16:00
* finish the Chinese message translation for group 17

* change as KingdaShi suggests

Co-authored-by: Michael Chirico <[email protected]>
* message

* message

* added hongyuans review

Co-authored-by: Michael Chirico <[email protected]>
* done some translation!

* done some translation

* messsage

* messsage

Co-authored-by: Michael Chirico <[email protected]>
*/
void frollmeanFast(double *x, uint64_t nx, ans_t *ans, int k, double fill, bool narm, int hasna, bool verbose) {
if (verbose)
snprintf(end(ans->message[0]), 500, "frollmeanFast: running for input length %"PRIu64", window %d, hasna %d, narm %d\n", (uint64_t)nx, k, hasna, (int)narm);
Copy link
Member Author

@MichaelChirico MichaelChirico Dec 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treat this more as a template so the message doesn't need to be translated redundantly (see elsewhere same file as well as frolladaptive)

* covering trivial translations from among the leftovers

* progress on paring down leftover translations

* completed translations!!!

* adjustments
}
tt = vapply_1i(byval,length)
if (any(tt!=xnrow)) stop("The items in the 'by' or 'keyby' list are length (",paste(tt,collapse=","),"). Each must be length ", xnrow, "; the same length as there are rows in x (after subsetting if i is provided).")
if (any(tt!=xnrow)) stop(gettextf("The items in the 'by' or 'keyby' list are length(s) (%s). Each must be length %d; the same length as there are rows in x (after subsetting if i is provided).", paste(tt, collapse=","), xnrow, domain='R-data.table'))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted some messages that are awkward to translate in chunks into one formatted message to be translated as one.

This should probably be done more broadly but it's pretty manual & Chinese matches grammar closely enough. If we ever support Japanese though, probably we'll need complete overhaul...

# as opposed to DT[order(.)] where na.last=TRUE, to be consistent with base
{
if (!is.data.frame(x)) stop("x must be a data.frame or data.table.")
if (!is.data.frame(x)) stop("x must be a data.frame or data.table")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same message appears elsewhere without ., aligning means only one translation needed

@MichaelChirico MichaelChirico changed the title WIP: Add Chinese error message translations Add Chinese error message translations Dec 30, 2019
@MichaelChirico
Copy link
Member Author

Hey @mattdowle we are good to go! All translations completed 🎉

@mattdowle
Copy link
Member

Under --as-cran there's now the following note on the overall tar.gz size :

* checking CRAN incoming feasibility ... NOTE
Size of tarball: 5129180 bytes

v1.12.8 tar.gz size was 4842680 bytes, just under the 5MB CRAN limit.

The tar.gz size is not reduced by the compression of the .Rraw files already done, because the tar.gz is already compressing everything. (The compression of the .Rraw files just keeps them compressed when the package is installed to alleviate the "installed size" notes seen previously.)

On next submission we'll see if this new tar.gz size note is just on submission, or whether CRAN checks displays it too. We could know that now if there's another package on CRAN with the same "size of tarball" note displayed by its CRAN checks page. Even if CRAN checks displays that note we'll just have to live with it. I hope the "CRAN ok" badge is still green ok if there are just notes.

@mattdowle
Copy link
Member

mattdowle commented Dec 31, 2019

I can't see a NEWS item. Maybe you can add one in a follow-up PR. I imagine it'll be quite a big item thanking all the translators too. It's a huge piece of work so maybe you could convey that somehow with some statistics about the number of messages translated at R and C level, how long it took, number of commits, number of files and lines touched, etc.

if (verbose)
snprintf(end(ans->message[0]), 500, "fadaptiverollmeanFast: running for input length %"PRIu64", hasna %d, narm %d\n", (uint64_t)nx, hasna, (int) narm);
snprintf(end(ans->message[0]), 500, _("%s: running for input length %"PRIu64", hasna %d, narm %d\n"), "fadaptiverollmeanFast", (uint64_t)nx, hasna, (int) narm);
bool truehasna = hasna>0; // flag to re-run if NAs detected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok understood re this change. Will be interesting if this passes Rtools ATC re PR #4073 and the discussion here: #4073 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it through win-builder ATC, and yes it passes with no warnings. So this means it was something about the __fun__ and not the %s before PRIu64. I'll add a link to the previous discussion pointing here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, glad to hear

@mattdowle mattdowle merged commit ff076a7 into master Dec 31, 2019
@mattdowle mattdowle deleted the translation branch December 31, 2019 01:05
mattdowle added a commit that referenced this pull request Dec 31, 2019
@mattdowle mattdowle mentioned this pull request Dec 31, 2019
@MichaelChirico
Copy link
Member Author

Yep indeed I missed the NEWS item 🤦‍♂ will add a follow-up with tags for all the contributors as well as a summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

translation issues/PRs related to message translation projects

Projects

None yet

Development

Successfully merging this pull request may close these issues.