toke.c: S_intuit_more: More effectively use S_scan_ident #23880

khwilliamson · 2025-10-28T03:04:29Z

People have avoided looking at this routine for a very long time. As a result it hasn't known until recently about lexical variables, for example. This p.r. rectifies that. The new enhancements committed today to scan_ident are now used by this function to examine potential identifiers better.

If 'strict vars' is in effect, and a variable is not found, it now assumes this is an error, or a character class was meant. This was done for the variable in effect at the time of the call to this function (external to it) recently via 3450d19. But is here extended to potential variables it encounters in its parse.

This set of changes does not require a perldelta entry.

This swaps the order of handling a conditional

The previous commit allows this one to outdent things.

Future commits will want to use this space.

Using \w here missed the possibility of lots of other syntaxes of variables, like $::foo or ${foo}, that scan_ident looks for. So call scan_ident without first filtering what it looks at. It is more convenient to then swap the code block that handles punctuation variables with the code block that handles other length 1 identifiers

Outdent, as the previous commit removed a block; and one block we indent for a future new block

I don't know why the code here uses a buffer 4 times bigger than what is the maximum an identifier can be. One might think this is because Unicode characters can be 4 bytes long, but this number has been in effect since before Perl knew about Unicode; or maybe it was to avoid a potential buffer overflow. But in any event, I'm pretty certain there is no need for that now. The buffer is now allocated to handle the maximum Unicode-needed size, and buffer overflow is checked for and avoided.

It returns more than a boolean; save the actual return for use in future commits

If scan_ident indicates that $ @ & are followed by nothing that looks like an identifier, then this isn't an expression. It has to be a character class or an error. Almost anything is an identifier when 'use utf8' isn't in effect; when it is, non ASCII has to be an Identifier Start character following these

This function was totally unaware of the possibility of these.

The code attempted to do this, but was written before lexical variables existed, and had never been updated.

happy-barney · 2025-10-28T04:47:31Z

toke.c

          case '\\':
-            if (s[1]) {
-                if (memCHRs("wds]", s[1])) {
+            if (s[1] == '\0') {


Personally: great change, maybe even add break into then branches and get rid of else if branches

Curious question: common pattern is condition if (s[1]), so why not use if (! s[1] as well

happy-barney · 2025-10-28T04:59:55Z

toke.c

-            return FALSE;
-        }
-        return TRUE;
+        return ! regcurly(s, e, NULL);


Great change. Former pattern appears to be used in other places, marking such cleanup as my next TODO once my current PR is merged.

happy-barney · 2025-10-28T05:29:52Z

toke.c

+                 * next character after it is a closing bracket, it makes it
+                 * quite likely to be that, and hence a subscript.  If it is
+                 * something else, more mildly a subscript */
+                if (/*{*/ memCHRs("])} =", tmpbuf[2]))


accidentally deleted previous comment :(

little bit confused: this happens in branch when strlen of tmpbuf is already checked to be 1, so tmpbuf[2] should \0

khwilliamson added 12 commits October 27, 2025 20:50

toke.c: S_intuit_more: Simplify a bit of code

3b86b9c

intuit_more: Handle easy thing first

b302a19

This swaps the order of handling a conditional

intuit_more: White space only

eb6d467

The previous commit allows this one to outdent things.

intuit_more: Reserve first byte of buffer

5eeec5e

Future commits will want to use this space.

intuit_more: Move declaration to first use

2aeccb9

intuit_more: White-space only

26848f0

Outdent, as the previous commit removed a block; and one block we indent for a future new block

intuit_more: Save scan_ident return

b61e5b8

It returns more than a boolean; save the actual return for use in future commits

intuit_more: Handle numeric identifiers

5e9f082

This function was totally unaware of the possibility of these.

intuit_more: Check if identifier exists

9f31304

The code attempted to do this, but was written before lexical variables existed, and had never been updated.

khwilliamson added the Use merge commit Don't merge this p.r. from github It contains multiple related commits. Instructions in perlgit label Oct 28, 2025

happy-barney reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

toke.c: S_intuit_more: More effectively use S_scan_ident #23880

toke.c: S_intuit_more: More effectively use S_scan_ident #23880

khwilliamson commented Oct 28, 2025

Uh oh!

happy-barney Oct 28, 2025

Uh oh!

happy-barney Oct 28, 2025

Uh oh!

happy-barney Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

toke.c: S_intuit_more: More effectively use S_scan_ident #23880

Are you sure you want to change the base?

toke.c: S_intuit_more: More effectively use S_scan_ident #23880

Conversation

khwilliamson commented Oct 28, 2025

Uh oh!

happy-barney Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

happy-barney Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

happy-barney Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants