Skip to content

Conversation

@bianbbc87
Copy link
Contributor

@bianbbc87 bianbbc87 commented Oct 16, 2025

What does this PR do / why we need it:
This PR addresses the recurring issues of Redis connection failures and DNS lookup errors observed in argocd-agent environments.

The default Argo CD Kustomize setup (install/argo-cd/*) does not define argocd-agent-agent or argocd-agent-principal as allowed sources to access argocd-redis.
In other words, the existing argocd-redis-network-policy follows the base Argo CD configuration and does not include any agent-related Pods in its ingress rules.

As a result, the agent fails to connect to Redis and repeatedly outputs the following error:

level=error msg="unable to connect to principal redis" error="dial tcp 10.96.245.6:6379: connect: connection timed out"

While the principal cluster has mitigated this issue by adding a redis-proxy#618 (comment), the agent cluster continues to experience the same connection failures.
To fix this, I propose updating all Kustomize variants that extend Argo CD (such as argocd-managed and argocd-autonomous) to include argocd-agent-agent as an allowed source in the Redis ingress policy.

Additionally, a more critical issue lies in the egress rules.
Because egress is currently restricted, both agent and principal pods cannot reach essential internal services such as argocd-server, controller, or even coredns. this also causes DNS lookups for service names to fail.
Therefore, this PR removes the egress restriction to restore proper internal communication between components.

Which issue(s) this PR fixes:
Fixes #566, #611

Related discussion: #510 (comment)
Related PR: #574

How to test changes / Special notes to the reviewer:
Assumption:
The principal cluster and agent have already successfully established a TLS connection.

Reproduction steps

  1. Check current NetworkPolicy
kubectl get networkpolicy argocd-redis-network-policy -n argocd --context <worker-cluster>
  1. Check error logs
kubectl logs -n argocd deployment/argocd-agent-agent --context <worker-cluster>

# Without ingress
redis: connection pool: failed to dial after 5 attempts: dial tcp 10.100.107.126:6379: i/o timeout

# With ingress properly configured
# Confirms successful Redis connection and normal appProject data retrieval.
level=info msg="Updating appProject" appProject=default method=UpdateAppProject module=Agent resourceVersion=1391
  1. Reproduce egress-related issue
    When egress is defined, both principal and agent clusters experience DNS lookup failures.
    This error blocks access to argocd-server, controller, repo-server, and coredns.
failed to get connection info from cluster: dial tcp: lookup argocd-redis: i/o timeout

Checklist

  • Documentation update is required by this PR (and has been updated) OR no documentation update is required.

@bianbbc87
Copy link
Contributor Author

I’m looking for a way to either deploy a redis-proxy on the agent side as well, or find a better approach to allow the agent Pods to access Redis through ingress.
However, I’m not yet sure how to reflect this in the Helm chart. 👀

@jannfis
Copy link
Collaborator

jannfis commented Oct 16, 2025

I’m looking for a way to either deploy a redis-proxy on the agent side as well

Hm, why would you do that? The Redis proxy is a component which is used by the Argo CD API server, so that it can access data stored on the Agent's Redis. The Redis proxy should be running on the control plane only.

@bianbbc87
Copy link
Contributor Author

I’m looking for a way to either deploy a redis-proxy on the agent side as well

Hm, why would you do that? The Redis proxy is a component which is used by the Argo CD API server, so that it can access data stored on the Agent's Redis. The Redis proxy should be running on the control plane only.

In practice, the Redis proxy only runs on the control plane, while the agent only hosts its own Redis server.
In the current Argo CD setup, ingress access is blocked.

The Redis proxy is used only when the principal needs to access the agent’s Redis through the proxy.
While preparing the guide, I ran many tests and, for faster verification, initially tested with only the principal cluster set up.
Looking back, I realized that I only saw successful Redis proxy connection logs, but no actual Redis connection logs that was my mistake.

However, in my tests, the agent couldn’t access its own argocd-redis because there was no ingress.
What I don’t understand is why this issue didn’t occur on the principal side it also didn’t have an ingress.

Does this mean that the principal wasn’t actually accessing its own Redis during execution?

@bianbbc87
Copy link
Contributor Author

I’m looking for a way to either deploy a redis-proxy on the agent side as well

Regarding your earlier comment, please ignore my previous suggestion about deploying a Redis proxy on the agent side.
I had misunderstood the Redis proxy’s role. Thanks for pointing that out! 🙏

@codecov-commenter
Copy link

codecov-commenter commented Oct 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.90%. Comparing base (2dd7c9f) to head (9e180dd).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #618   +/-   ##
=======================================
  Coverage   45.90%   45.90%           
=======================================
  Files          90       90           
  Lines       12103    12103           
=======================================
  Hits         5556     5556           
  Misses       6101     6101           
  Partials      446      446           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bianbbc87 bianbbc87 force-pushed the fix/redis-network-policy branch from 7bfb654 to 790ab84 Compare October 20, 2025 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing NetworkPolicy for accessing Redis

4 participants