Skip to content

Conversation

@treblereel
Copy link
Collaborator

Many thanks for submitting your Pull Request ❤️!

What this PR does / why we need it:

Special notes for reviewers:

Additional information (if needed):

@treblereel treblereel requested a review from fjtirado as a code owner November 3, 2025 06:28
@treblereel treblereel self-assigned this Nov 3, 2025
@treblereel treblereel added the java Pull requests that update java code label Nov 3, 2025
@treblereel treblereel force-pushed the ContainerTask branch 2 times, most recently from 4384a68 to a8f5dd5 Compare November 4, 2025 02:36
@fjtirado fjtirado self-requested a review November 5, 2025 13:31
Copy link
Collaborator

@fjtirado fjtirado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes are reflected in this PR treblereel#3

Copy link
Collaborator

@fjtirado fjtirado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are small comments. I was going to write on every line, but I decided to pair my analysis with AI. Report generated with assistance:


High-impact issues (bugs / logic gaps)

  1. Timeout path returns success (0) instead of failing
    In waitAccordingToLifetime, when policy == EVENTUALLY and awaitStatusCode(timeout) times out (or throws), you safeStop(id) and then fall through to the method’s final return 0;. That silently treats timeouts as success.
  • Fix: detect timeout and throw a clear TimeoutException (or your domain error) with the configured timeout and container name/id. Only return 0 on actual exit code 0.
  1. policy may be null → NPE risk
    If the workflow’s ContainerLifetime is absent, policy is never set. You later compare if (policy == ContainerCleanupPolicy.EVENTUALLY) ….
  • Fix: set a default policy in the builder (e.g., IMMEDIATELY), or guard against null.
  1. Misleading exception messages (“exit code”) for non-exit errors
    Catching InterruptedException / IOException and throwing “failed with exit code …” is confusing.
  • Fix: use accurate messages: “interrupted while executing container …”, “I/O error while …”, etc. Keep exit code language only for real exit codes.
  1. Always pulling images = slow & noisy
    pullImageIfNeeded always pulls. That’s expensive and brittle offline.
  • Fix: inspectImageCmd(image) first; only pull on NotFoundException. Consider a flag to “always pull” vs “if not present”.
  1. Tags/digests handling
    NameParser.parseRepositoryTag + defaulting to latest won’t handle digests (@sha256:…) correctly and may fetch unexpected tags.
  • Fix: detect digests and call pullImageCmd(repo).withTag(tag) only when a tag exists; for digests use withRepository + withTag appropriately or avoid re-tagging entirely.
  1. Timeout result handling
    awaitStatusCode(timeout) returns Integer (nullable). You don’t check null before using it.
  • Fix: treat null as timeout and fail (don’t return 0).
  1. Auto-remove mismatch
    Comment says “must be removed because of withAutoRemove(true)” but I don’t see you setting withAutoRemove(true) here (maybe a property setter does it). If it’s not guaranteed, you might orphan containers.
  • Fix: ensure HostConfig.withAutoRemove(true) is always set for short-lived tasks, or make the cleanup policy explicit.

Robustness & operability

  1. Logs on failure
    When a container exits non-zero, you throw a mapped error but don’t surface container logs—which makes debugging painful.
  • Recommendation: on non-zero (or timeout), stream/capture the last N KB of stdout/stderr (logContainerCmd(id).withStdOut(true).withStdErr(true).withTail(n)), and include a trimmed snippet in the exception (and full logs in your engine’s diagnostics).
  1. Resource limits & safety rails
    No CPU/memory/FS caps → a user container can hog the host.
  • Recommendation: support resource settings (CPU quota/period, cpus, memory/memorySwap, pidsLimit, read-only rootfs, drop capabilities, seccomp/apparmor profiles). Default to conservative limits; allow workflow overrides in a controlled/allow-listed way.
  • This one we might have to adjust on upstream specification
  1. Image allow-list / registry auth
    Consider an allow-list of images/prefixes and registry credentials. Otherwise, anyone can run :latest from anywhere.
  • Recommendation: resolve images through a policy component; support private registry auth via AuthConfig.
  • We might have to add a config to the engine core such as sw.impl.containers.policy - @fjtirado
  1. Networking defaults
    Decide your default network mode (none/bridge/custom). Workflows often don’t need internet access.
  • Recommendation: default to a locked-down network, only open what the task needs.
  1. Graceful stop semantics
    safeStop(id, 10s) is good, but for stubborn containers a kill fallback may be needed if stop doesn’t terminate.
  • Recommendation: if still running after .stop() + wait, send kill (SIGKILL) with a short deadline before remove.
  1. Races around isRunning / remove
    Between inspect and stop/remove the state may change; your try/catch(ignore) masks real issues—but that’s fine for cleanup. Consider logging at debug so post-mortems are possible.

  2. Singleton Docker client lifecycle
    DockerClientHolder is static and never closed.

  • Recommendation: register a shutdown hook / application lifecycle hook to close the HTTP client; or rely on process lifetime but document it.
  1. Threading & pool usage
    CompletableFuture.supplyAsync(..., executorService()) is correct, but pulls/waits can be long. Ensure your executor is sized for this (don’t starve other workflow tasks). Consider a dedicated pool for container jobs.

  2. Return value & output contract
    startSync returns the same input if exit is 0. If containers are expected to produce artifacts/logs/files, make that contract explicit (e.g., volumes or captured stdout). If stdout should become the task output, wire it.

  3. Port/volume validation
    Your property setters likely map ports/volumes directly. Validate host paths (no / escape tricks), read-only when possible, and sanitize port ranges.

  4. Name & label hygiene
    Add consistent labels (workflow id, task id, run id) and a predictable Name. This helps you find & clean orphans if the engine crashes.

e.g. labels: sw.workflow=, sw.instance=, sw.task=, sw.run=.

  1. Image “latest” pitfalls
    Defaulting to latest is reproducibility-hostile.
  • Recommendation: require explicit tags in production; optionally enforce via validation.
  1. Map more exit codes carefully
    Your map is fine, but consider surfacing exit status + signal from inspect to avoid mislabeling (e.g., 137 might be OOMKill vs SIGKILL). You can fetch inspectContainerCmd(id).getState().getOOMKilled() and include that.

I reviewed and changed every item on this report.

assertTrue(ports.containsKey(ExposedPort.tcp(8881)));
assertTrue(ports.containsKey(ExposedPort.tcp(8882)));

dockerClient.removeContainerCmd(containerId).withForce(true).exec();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this to a finally block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to all tests; the procedure remains the same.

- runContainer:
run:
container:
image: alpine:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use busybox instead: https://hub.docker.com/_/busybox

It's less than 1MB.

@fjtirado
Copy link
Collaborator

fjtirado commented Nov 7, 2025

Some comments on the IA report

Timeout result handling
awaitStatusCode(timeout) returns Integer (nullable). You don’t check null before using it.
Fix: treat null as timeout and fail (don’t return 0).

IA is not perfect ;), I already grasped this one, the Integer (null or not) is passed to the upper methods, which checks for null in startSync method

Threading & pool usage
CompletableFuture.supplyAsync(..., executorService()) is correct, but pulls/waits can be long. Ensure your executor is sized for this (don’t starve other workflow tasks). Consider a dedicated pool for container jobs.

hmmm, picky IA, the answer is not, we are not going to create specific executors for potentially long tasks.

Return value & output contract
startSync returns the same input if exit is 0. If containers are expected to produce artifacts/logs/files, make that contract explicit (e.g., volumes or captured stdout). If stdout should become the task output, wire it.

Good point that I believe we discussed briefly offline. I do not see it included here

@fjtirado
Copy link
Collaborator

fjtirado commented Nov 7, 2025

policy may be null → NPE risk
If the workflow’s ContainerLifetime is absent, policy is never set. You later compare if (policy == ContainerCleanupPolicy.EVENTUALLY) ….
Fix: set a default policy in the builder (e.g., IMMEDIATELY), or guard against null.

This in intentional, the == comparison is not going to cause a NPE

@fjtirado
Copy link
Collaborator

fjtirado commented Nov 7, 2025

Image allow-list / registry auth
Consider an allow-list of images/prefixes and registry credentials. Otherwise, anyone can run :latest from anywhere.
Recommendation: resolve images through a policy component; support private registry auth via AuthConfig.
We might have to add a config to the engine core such as sw.impl.containers.policy - @fjtirado

I do not exactly see what the issue is, but in any case we already have a configuration object .
If we want (because the IA mention it ;)) to define which images can be run on advance, we can add a property with the list of images allowed, but I do not see a real reason to do that and I do not see this affecting the core engine ;)

@fjtirado
Copy link
Collaborator

fjtirado commented Nov 7, 2025

Graceful stop semantics
safeStop(id, 10s) is good, but for stubborn containers a kill fallback may be needed if stop doesn’t terminate.

I do not think hardcoded timeouts are ok, this should be read from config. Also, probably the stop container can be run asynchronously, but this we can leave to a follow-up PR,. As first version is more than fine

@fjtirado
Copy link
Collaborator

fjtirado commented Nov 7, 2025

Image “latest” pitfalls
Defaulting to latest is reproducibility-hostile.
Recommendation: require explicit tags in production; optionally enforce via validation.

We are defaulting to latest only if user has not set anything. I do not think the alternative, force him to include the version in the image name or throw exception, is more desirable than the current approach. As we usually said, we should prioritize user experience.

@fjtirado
Copy link
Collaborator

fjtirado commented Nov 7, 2025

Singleton Docker client lifecycle
DockerClientHolder is static and never closed.

This is an intentional design choice. This is lazily loaded by the JVM the first time is used and keep alive till the process is done (since the WorkflowApplication should be tied with the proces lifecycle, it should be fine)
But.... we probably should allow users to be able to change the default client, like we are doing now with the Http client, take a look to https://github.com/serverlessworkflow/sdk-java/blob/main/impl/http/src/main/java/io/serverlessworkflow/impl/executors/http/HttpClientResolver.java#L32-L37 and WorkflowApplication.withAdditionalObject

@fjtirado
Copy link
Collaborator

fjtirado commented Nov 7, 2025

Comment says “must be removed because of withAutoRemove(true)” but I don’t see you setting withAutoRemove(true) here (maybe a property setter does it). If it’s not guaranteed, you might orphan containers.

Yes, a property setter is setting it when policy is ALWAYS or NEVER

@ricardozanini
Copy link
Member

We are defaulting to latest only if user has not set anything. I do not think the alternative, force him to include the version in the image name or throw exception, is more desirable than the current approach. As we usually said, we should prioritize user experience.

I agree, we can keep using latest, although it is not recommended. However, everyone defaults to latest when emptying.

@ricardozanini
Copy link
Member

This in intentional, the == comparison is not going to cause a NPE

But the point is to set a default, I think it's a fair game.

@treblereel treblereel force-pushed the ContainerTask branch 2 times, most recently from 03d97ac to 0728b37 Compare November 7, 2025 20:21
fjtirado and others added 6 commits November 7, 2025 12:21
Signed-off-by: Dmitrii Tikhomirov <[email protected]>

Signed-off-by: Dmitrii Tikhomirov <[email protected]>
Signed-off-by: Dmitrii Tikhomirov <[email protected]>
Signed-off-by: Dmitrii Tikhomirov <[email protected]>
Signed-off-by: fjtirado <[email protected]>
Signed-off-by: Dmitrii Tikhomirov <[email protected]>
fjtirado and others added 3 commits November 7, 2025 12:21
Signed-off-by: fjtirado <[email protected]>
Signed-off-by: Dmitrii Tikhomirov <[email protected]>
Co-authored-by: Ricardo Zanini <[email protected]>
Signed-off-by: Dmitrii Tikhomirov <[email protected]>
Signed-off-by: Dmitrii Tikhomirov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

java Pull requests that update java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants