Skip to content

Conversation

aguynamedryan
Copy link

As described in #5124, Athena requires the WITH(format='parquet') to use a string literal on the right hand side of the the format=

@aguynamedryan aguynamedryan force-pushed the athena-file-format-fix branch from 326bbf7 to 1535ac3 Compare May 28, 2025 04:07
Comment on lines -19 to +20
dialect = None
dialect: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change? The rest of the file is not typed.

@@ -337,3 +337,22 @@ def test_parse_partitioned_by_returns_iceberg_transforms(self):
assert isinstance(parsed.this, exp.Schema)
assert next(n for n in parsed.this.expressions if isinstance(n, exp.PartitionedByBucket))
assert next(n for n in parsed.this.expressions if isinstance(n, exp.PartitionByTruncate))

def test_ctas_uses_string_for_format(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a validate_all test as well to demonstrate that Spark -> Athena is now covered?

Comment on lines +74 to +78
def _file_format_property_sql(self: Athena.Generator, e: exp.FileFormatProperty) -> str:
this = e.args.get("this")
if not this:
return "format=''"
return f"format={exp.Literal.string(this.name)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit more verbose than necessary, let's inline it in TRANSFORMS:

            exp.FileFormatProperty: lambda self, e: f"format={self.sql(exp.Literal.string(e.name)}",

@georgesittas
Copy link
Collaborator

Thanks for the contribution @aguynamedryan, LGTM, just a few comments to clean this up a little.

@georgesittas
Copy link
Collaborator

georgesittas commented May 28, 2025

@erindru can you take a look as well? I could reproduce the issue with a non-string:

>>> from sqlmesh import Context
>>> ctx = Context()
>>> ctx.engine_adapter
<sqlmesh.core.engine_adapter.athena.AthenaEngineAdapter object at 0x121765400>
>>> ctx.engine_adapter.execute('CREATE TABLE "db"."tab" WITH (format=parquet) AS SELECT 1 AS "col"')
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: expression parquet is not a string literal

For context, last modification was made here, and before that Vaggelis had made a relevant PR here.

FWIW Trino and Presto do support string literals and we generate that syntax in Presto here, so I wonder if the override is needed now. I think if we got rid of that FileFormatProperty entry it'd just work.

@erindru
Copy link
Collaborator

erindru commented May 28, 2025

Err, how is that a non-string? You pass a string to engine_adapter.execute() which means that it isnt touched by SQLGlot at all?

My understanding is that the problem is this:

>>> parse_one("""CREATE TABLE `db`.`tab` USING parquet AS SELECT 1 AS `col`""", dialect="spark")
Create(
  ...SNIP...
      FileFormatProperty(this=Var(this=parquet))]))

exp.Var() is correct for Spark because it's not a string literal on Spark so should not have the quotes.

>>> parse_one("create table foo with(format='parquet') as select 1", dialect="athena")
Create(
  ...SNIP...
      FileFormatProperty(this=Literal(this='parquet', is_string=True))]))

This results in exp.Literal for Athena because in Athena it is a string literal and not a variable/identifier, so should have the quotes

So the issue is transpilation.

I think if we got rid of that FileFormatProperty entry it'd just work

Almost. Unfortunately this results in the property name getting converted to uppercase, but the casing matters for Athena. FORMAT='PARQUET' throws an error:

Table properties [FORMAT] are not supported.

but format='PARQUET' works fine.

In researching the right thing to do here and adjusting the tests to check my theory, I ended up implementing this: #5133

@georgesittas
Copy link
Collaborator

georgesittas commented May 28, 2025

Err, how is that a non-string? You pass a string to engine_adapter.execute() which means that it isnt touched by SQLGlot at all?

Sorry, I wasn't clear enough. I intentionally used a var in that execute call to test whether Athena would be ok with it or not, i.e., parquet instead of 'parquet'. The problem is indeed transpilation, just testing whether a var would also work for Athena.

Do Presto and Trino accept a lowercase 'parquet'? If they do, then by getting rid of upper() in f"FORMAT='{e.name.upper()}'" we could get 3 birds with 1 stone :)

@georgesittas
Copy link
Collaborator

Closing in favor of e9b3156.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants