-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If deployment is never available propagate the container msg #14835
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14835 +/- ##
==========================================
- Coverage 84.75% 84.68% -0.08%
==========================================
Files 218 218
Lines 13482 13502 +20
==========================================
+ Hits 11427 11434 +7
- Misses 1688 1700 +12
- Partials 367 368 +1 ☔ View full report in Codecov by Sentry. |
test/e2e/image_pull_error_test.go
Outdated
@@ -52,13 +52,10 @@ func TestImagePullError(t *testing.T) { | |||
cond := r.Status.GetCondition(v1.ConfigurationConditionReady) | |||
if cond != nil && !cond.IsUnknown() { | |||
if cond.IsFalse() { | |||
if cond.Reason == wantCfgReason { | |||
if cond.Reason == wantCfgReason && strings.Contains(cond.Message, "Back-off pulling image") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the configuration will be go from status:
k get configuration -n serving-tests
NAME LATESTCREATED LATESTREADY READY REASON
image-pull-error-dagmxojy image-pull-error-dagmxojy-00001 Unknown
to status failed after the progressdeadline was exceeded (120s in tests).
Here instead, due to this the patch, as soon as it sees no availability at the deployment side it will mark the revision as ready=false and configuration will get:
{
"lastTransitionTime": "2024-01-25T13:44:18Z",
"message": "Revision \"image-pull-error-kbfvvcsg-00001\" failed with message: Deployment does not have minimum availability..",
"reason": "RevisionFailed",
"status": "False",
"type": "Ready"
}
Later on after progressdeadline is passed it will get expected msg here.
That is why we will have to wait.
"conditions": [
{
"lastTransitionTime": "2024-01-25T13:36:08Z",
"message": "Revision \"image-pull-error-gnwactac-00001\" failed with message: Deployment does not have minimum availability..",
"reason": "RevisionFailed",
"status": "False",
"type": "Ready"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to cover rpc error: code = NotFound desc = failed to pull and unpack image
as well, as seen from the failures.
@@ -549,6 +549,30 @@ func TestReconcile(t *testing.T) { | |||
Object: pa("foo", "pull-backoff", WithReachabilityUnreachable), | |||
}}, | |||
Key: "foo/pull-backoff", | |||
}, { | |||
Name: "surface ImagePullBackoff when previously scaled ok but now image is missing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers the case where we are not in a rolout eg. scaling from zero and image is not there.
The goal is that for this scenario when we scale down to zero we will have:
{
"lastTransitionTime": "2024-01-23T20:16:43Z",
"message": "The target is not receiving traffic.",
"reason": "NoTraffic",
"severity": "Info",
"status": "False",
"type": "Active"
},
{
"lastTransitionTime": "2024-01-23T20:03:07Z",
"status": "True",
"type": "ContainerHealthy"
},
{
"lastTransitionTime": "2024-01-23T20:05:38Z",
"message": "Deployment does not have minimum availability.",
"reason": "MinimumReplicasUnavailable",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2024-01-23T20:05:38Z",
"message": "Deployment does not have minimum availability.",
"reason": "MinimumReplicasUnavailable",
"status": "False",
"type": "ResourcesAvailable"
}
instead of
{
"lastTransitionTime": "2024-01-23T20:03:07Z",
"severity": "Info",
"status": "True",
"type": "Active"
},
{
"lastTransitionTime": "2024-01-23T20:03:07Z",
"status": "True",
"type": "ContainerHealthy"
},
{
"lastTransitionTime": "2024-01-23T20:03:07Z",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2024-01-23T20:03:07Z",
"status": "True",
"type": "ResourcesAvailable"
}
so user knows that something was not ok.
The earlier concerns with using We don't want to mark the revision as ready=false when scaling is occuring. I haven't dug into the code changes in the PR yet but how do we handle that scenario? |
My goal is to only touch status when progressdeadline does not seems to work (kubernetes/kubernetes#106054). That means only apply when |
Yeah ProgressDeadline only seems to apply when doing a rollout from one replicaset to another one |
I discovered that here |
/test istio-latest-no-mesh |
infra |
m := revisionCondSet.Manage(rs) | ||
avCond := m.GetCondition(RevisionConditionResourcesAvailable) | ||
|
||
// Skip if set for other reasons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other reasons? Why would we need to skip in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to change the current state machine see comment: #14835 (comment). So I am only targeting a specific case.
Now if for example the deployment faced an issue and revision avail condition is set to false already, I am not going to update it and set it to false again, I am just skipping the update and keep things as is.
In general it should not make a difference as if that was an intermediate state to have avail cond false (it happens when replicas are not ready) for some other reason, then it is going to be reset to true and then later we will set it to false again anyway.
I can try without this in a test PR but I am wondering for side effects in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the I think the comment was just not explaining this fully.
pkg/testing/v1/revision.go
Outdated
@@ -137,6 +137,12 @@ func MarkInactive(reason, message string) RevisionOption { | |||
} | |||
} | |||
|
|||
func MarkActiveUknown(reason, message string) RevisionOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
@@ -92,3 +90,8 @@ func createLatestConfig(t *testing.T, clients *test.Clients, names test.Resource | |||
c.Spec = *v1test.ConfigurationSpec(names.Image) | |||
}) | |||
} | |||
|
|||
func hasPullErrorMsg(msg string) bool { | |||
return strings.Contains(msg, "Back-off pulling image") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do these strings come from? How do we know it's all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not hard by trial and error, see previous PR here: #4348.
I know it it not that great but I don't expect it to be that of a problem.
Alternatively we can make it independent of the string, so I will have to wait for the configuration to fail and then wait also for revision to have the right status. The reason I have it this way is that if I dont check for the right configuration status with the right msg it goes quickly to check the revision and since there is no wait right now for the revision check it will fail immediately (at the revision check).
Note here that with this patch we change the initial status to be false for the configuration so the first configuration check in the test passes quickly.
func (pas *PodAutoscalerStatus) MarkNotReady(reason, mes string) { | ||
podCondSet.Manage(pas).MarkUnknown(PodAutoscalerConditionReady, reason, mes) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this isn't needed because - marking SKSReady=False will mark the PA Ready=False
serving/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go
Lines 34 to 38 in ac979ec
var podCondSet = apis.NewLivingConditionSet( | |
PodAutoscalerConditionActive, | |
PodAutoscalerConditionScaleTargetInitialized, | |
PodAutoscalerConditionSKSReady, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in tests (table_test.go) to show the expected status for pa but probably can be removed.
logger.Infof("marking resources unavailable with: %s: %s", w.Reason, w.Message) | ||
rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message) | ||
} else { | ||
rev.Status.PropagateDeploymentAvailabilityStatusIfFalse(&deployment.Status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status message on minAvailable isn't very informative. In the reported issue I'm wondering if it's better to surface ErrImagePull/ImagePullBackOff
from the Pod.
I believe that's what's the original code intended - but hasDeploymentTimedOut
is broken because it doesn't take into account a deployment not changing and scaling from zero. It seems like that's what we should be addressing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it generic just to show that availability was never reached. Let me check if I can get the container status and how that works.
hey @skonto are you still working on this? |
@dprotaso yes I will take a look to update it based on your comment here: #14835 (comment). |
/retest |
Looks like the failures are legit in the |
/hold for release tomorrow (if prow works ;) ) |
It needs more work. |
@skonto: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
3b7524c
to
1106210
Compare
@dprotaso pls review. Probably this needs a release comment. |
Testing this out it seems like regular cold starts cause revision/config/ksvc to all become failed (ready=false) and then once the pod is running it flips back to ready=true. Which I don't think is ideal. I think what we really want is to signal when it's failed to scale but only after the progressdeadline. Maybe that happens as a separate status (warning?) condition unsure. |
It changes the behaviour by design, as stated in the description it starts with false. Anyway I will check if we can have something less intrusive.
Probably we could do that at the scaler side when activating and before we scale down to zero by using the pod accessor and inspecting pods directly I suspect. It should be less intrusive. |
closing in favor of #15326 |
Fixes #14157
Proposed Changes
Then the ksvc when the deployment is scaled back to zero will have:
InitialDelaySeconds
pass.Right now we have when a ksvc is deployed, at the beginning of it lifecycle:
With this PR we start with a failing status until it is cleared: