If deployment is never available propagate the container msg #14835

skonto · 2024-01-24T14:40:39Z

Proposed Changes

This propagates the msg of the container t when deployment never reaches availability and keeps having:

        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "Ready"
        }

Then the ksvc when the deployment is scaled back to zero will have:


           "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T09:18:31Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/helloworld-go@sha256:dd20d7659c16bdc58c09740a543ef3c36b7c04742a2b6b280a30c2a76dcf6c09\".",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
                ],

This changes the initial state from unknown to "failed" even for normal ksvcs, however once we are up this is cleared at the configuration level as well. The idea for this initial state comes from the fact that K8s for readiness probes also considers a probe failed until InitialDelaySeconds pass.
Right now we have when a ksvc is deployed, at the beginning of it lifecycle:

                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "status": "Unknown",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "RoutesReady"
                    }

With this PR we start with a failing status until it is cleared:

                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: .",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }

To reproduce the initial issue with minikube you can use the following:
- apply a ksvc
- wait for a pod to come up and then ksvc to scale to zero
- minikube image list and then minikube image rm so that no image is available within minikube for the user container.
- block any internet access so image can't be pulled
- issue a request via curl ...
- observe the revision and deployment statuses
When the issue is resolved the next request will clear the status messages:

               "conditions": [
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "ConfigurationsReady"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "Ready"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "RoutesReady"
                   }

codecov · 2024-01-24T14:46:30Z

Codecov Report

Attention: Patch coverage is 48.00000% with 13 lines in your changes missing coverage. Please review.

Project coverage is 84.68%. Comparing base (bb51203) to head (fababbe).

Files	Patch %	Lines
pkg/apis/serving/v1/revision_lifecycle.go	0.00%	11 Missing ⚠️
pkg/reconciler/revision/reconcile_resources.go	81.81%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14835      +/-   ##
==========================================
- Coverage   84.75%   84.68%   -0.08%     
==========================================
  Files         218      218              
  Lines       13482    13502      +20     
==========================================
+ Hits        11427    11434       +7     
- Misses       1688     1700      +12     
- Partials      367      368       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skonto · 2024-01-25T13:45:17Z

test/e2e/image_pull_error_test.go

@@ -52,13 +52,10 @@ func TestImagePullError(t *testing.T) {
 		cond := r.Status.GetCondition(v1.ConfigurationConditionReady)
 		if cond != nil && !cond.IsUnknown() {
 			if cond.IsFalse() {
-				if cond.Reason == wantCfgReason {
+				if cond.Reason == wantCfgReason && strings.Contains(cond.Message, "Back-off pulling image") {


Previously the configuration will be go from status:

k get configuration -n serving-tests NAME LATESTCREATED LATESTREADY READY REASON image-pull-error-dagmxojy image-pull-error-dagmxojy-00001 Unknown

to status failed after the progressdeadline was exceeded (120s in tests).
Here instead, due to this the patch, as soon as it sees no availability at the deployment side it will mark the revision as ready=false and configuration will get:

{ "lastTransitionTime": "2024-01-25T13:44:18Z", "message": "Revision \"image-pull-error-kbfvvcsg-00001\" failed with message: Deployment does not have minimum availability..", "reason": "RevisionFailed", "status": "False", "type": "Ready" }

Later on after progressdeadline is passed it will get expected msg here.
That is why we will have to wait.

"conditions": [ { "lastTransitionTime": "2024-01-25T13:36:08Z", "message": "Revision \"image-pull-error-gnwactac-00001\" failed with message: Deployment does not have minimum availability..", "reason": "RevisionFailed", "status": "False", "type": "Ready" }

Need to cover rpc error: code = NotFound desc = failed to pull and unpack image as well, as seen from the failures.

skonto · 2024-01-25T14:10:04Z

pkg/reconciler/revision/table_test.go

@@ -549,6 +549,30 @@ func TestReconcile(t *testing.T) {
 			Object: pa("foo", "pull-backoff", WithReachabilityUnreachable),
 		}},
 		Key: "foo/pull-backoff",
+	}, {
+		Name: "surface ImagePullBackoff when previously scaled ok but now image is missing",


This covers the case where we are not in a rolout eg. scaling from zero and image is not there.
The goal is that for this scenario when we scale down to zero we will have:

{ "lastTransitionTime": "2024-01-23T20:16:43Z", "message": "The target is not receiving traffic.", "reason": "NoTraffic", "severity": "Info", "status": "False", "type": "Active" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "ContainerHealthy" }, { "lastTransitionTime": "2024-01-23T20:05:38Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Ready" }, { "lastTransitionTime": "2024-01-23T20:05:38Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "ResourcesAvailable" }

instead of

{ "lastTransitionTime": "2024-01-23T20:03:07Z", "severity": "Info", "status": "True", "type": "Active" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "ContainerHealthy" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "Ready" }, { "lastTransitionTime": "2024-01-23T20:03:07Z", "status": "True", "type": "ResourcesAvailable" }

so user knows that something was not ok.

dprotaso · 2024-01-25T14:28:47Z

The earlier concerns with using Available is that it changes when the revision is scaling. Thus Available=False when scaling up until all the pods are ready.

We don't want to mark the revision as ready=false when scaling is occuring. I haven't dug into the code changes in the PR yet but how do we handle that scenario?

skonto · 2024-01-25T14:48:45Z

We don't want to mark the revision as ready=false when scaling is occuring.

My goal is to only touch status when progressdeadline does not seems to work (kubernetes/kubernetes#106054). That means only apply when if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 holds and there is some waiting happening which means in that case nothing is ready actually. Also I only set the revision availability to false if it is not false already, otherwise I don't touch it.

dprotaso · 2024-01-25T14:58:19Z

Yeah ProgressDeadline only seems to apply when doing a rollout from one replicaset to another one

dprotaso · 2024-01-25T15:23:34Z

I discovered that here
kubernetes/kubernetes#106697

skonto · 2024-02-02T14:44:04Z

/test istio-latest-no-mesh

skonto · 2024-02-05T09:15:08Z

infra
/test istio-latest-no-mesh

skonto · 2024-02-05T09:15:32Z

@dprotaso @ReToCode gentle ping

ReToCode · 2024-02-05T14:36:42Z

pkg/apis/serving/v1/revision_lifecycle.go

+	m := revisionCondSet.Manage(rs)
+	avCond := m.GetCondition(RevisionConditionResourcesAvailable)
+
+	// Skip if set for other reasons


What other reasons? Why would we need to skip in that case?

I don't want to change the current state machine see comment: #14835 (comment). So I am only targeting a specific case.
Now if for example the deployment faced an issue and revision avail condition is set to false already, I am not going to update it and set it to false again, I am just skipping the update and keep things as is.
In general it should not make a difference as if that was an intermediate state to have avail cond false (it happens when replicas are not ready) for some other reason, then it is going to be reset to true and then later we will set it to false again anyway.
I can try without this in a test PR but I am wondering for side effects in general.

Ok, the I think the comment was just not explaining this fully.

ReToCode · 2024-02-05T14:38:21Z

pkg/testing/v1/revision.go

@@ -137,6 +137,12 @@ func MarkInactive(reason, message string) RevisionOption {
 	}
 }

+func MarkActiveUknown(reason, message string) RevisionOption {


ReToCode · 2024-02-05T14:39:06Z

test/e2e/image_pull_error_test.go

@@ -92,3 +90,8 @@ func createLatestConfig(t *testing.T, clients *test.Clients, names test.Resource
 		c.Spec = *v1test.ConfigurationSpec(names.Image)
 	})
 }
+
+func hasPullErrorMsg(msg string) bool {
+	return strings.Contains(msg, "Back-off pulling image") ||


where do these strings come from? How do we know it's all of them?

It is not hard by trial and error, see previous PR here: #4348.
I know it it not that great but I don't expect it to be that of a problem.
Alternatively we can make it independent of the string, so I will have to wait for the configuration to fail and then wait also for revision to have the right status. The reason I have it this way is that if I dont check for the right configuration status with the right msg it goes quickly to check the revision and since there is no wait right now for the revision check it will fail immediately (at the revision check).
Note here that with this patch we change the initial status to be false for the configuration so the first configuration check in the test passes quickly.

dprotaso · 2024-02-20T00:06:49Z

pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

+func (pas *PodAutoscalerStatus) MarkNotReady(reason, mes string) {
+	podCondSet.Manage(pas).MarkUnknown(PodAutoscalerConditionReady, reason, mes)
+}


FYI this isn't needed because - marking SKSReady=False will mark the PA Ready=False

serving/pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

Lines 34 to 38 in ac979ec

var podCondSet = apis.NewLivingConditionSet(

PodAutoscalerConditionActive,

PodAutoscalerConditionScaleTargetInitialized,

PodAutoscalerConditionSKSReady,

)

Ok I will check it.

This is only used in tests (table_test.go) to show the expected status for pa but probably can be removed.

dprotaso · 2024-02-21T02:06:53Z

pkg/reconciler/revision/reconcile_resources.go

+							logger.Infof("marking resources unavailable with: %s: %s", w.Reason, w.Message)
+							rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message)
+						} else {
+							rev.Status.PropagateDeploymentAvailabilityStatusIfFalse(&deployment.Status)


The status message on minAvailable isn't very informative. In the reported issue I'm wondering if it's better to surface ErrImagePull/ImagePullBackOff from the Pod.

I believe that's what's the original code intended - but hasDeploymentTimedOut is broken because it doesn't take into account a deployment not changing and scaling from zero. It seems like that's what we should be addressing here

I kept it generic just to show that availability was never reached. Let me check if I can get the container status and how that works.

dprotaso · 2024-03-26T19:12:59Z

hey @skonto are you still working on this?

skonto · 2024-03-28T13:24:02Z

hey @skonto are you still working on this?

@dprotaso yes I will take a look to update it based on your comment here: #14835 (comment).

dprotaso · 2024-04-23T01:56:30Z

/retest

dprotaso · 2024-04-23T02:11:25Z

Looks like the failures are legit in the api package

dprotaso · 2024-04-23T02:12:04Z

/hold for release tomorrow (if prow works ;) )

skonto · 2024-04-26T07:31:53Z

It needs more work.

knative-prow · 2024-05-27T18:10:43Z

@skonto: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
certmanager-integration-tests_serving_main	`3b7524c`	link	true	`/test certmanager-integration-tests`

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

skonto · 2024-06-04T12:23:13Z

@dprotaso pls review. Probably this needs a release comment.

dprotaso · 2024-06-06T21:46:20Z

Testing this out it seems like regular cold starts cause revision/config/ksvc to all become failed (ready=false) and then once the pod is running it flips back to ready=true. Which I don't think is ideal.

I think what we really want is to signal when it's failed to scale but only after the progressdeadline. Maybe that happens as a separate status (warning?) condition unsure.

skonto · 2024-06-07T20:28:13Z

Which I don't think is ideal.

It changes the behaviour by design, as stated in the description it starts with false. Anyway I will check if we can have something less intrusive.

I think what we really want is to signal when it's failed to scale but only after the progressdeadline.

Probably we could do that at the scaler side when activating and before we scale down to zero by using the pod accessor and inspecting pods directly I suspect. It should be less intrusive.

skonto · 2024-06-13T09:29:58Z

closing in favor of #15326

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2024

knative-prow bot requested review from dprotaso, evankanderson and ReToCode January 24, 2024 14:40

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jan 24, 2024

skonto removed the request for review from evankanderson January 24, 2024 14:41

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 25, 2024

skonto commented Jan 25, 2024

View reviewed changes

skonto changed the title ~~[WIP] If deployment is never available propagate the msg~~ If deployment is never available propagate the msg Feb 2, 2024

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2024

ReToCode reviewed Feb 5, 2024

View reviewed changes

dprotaso reviewed Feb 21, 2024

View reviewed changes

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2024

knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2024

skonto changed the title ~~If deployment is never available propagate the container msg~~ [wip] If deployment is never available propagate the container msg Apr 26, 2024

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2024

skonto added 6 commits June 4, 2024 12:16

if deployment is never available propagate the msg

cddf273

fix for progressdeadline case

501d053

only wait for configuration and for the right failure

c36de9f

fix lint

0fe1878

fix test

b144041

fix rebase

1106210

skonto force-pushed the propagate_rev_not_available branch from 3b7524c to 1106210 Compare June 4, 2024 09:17

fixes

3612664

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 4, 2024

lint

2524059

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 4, 2024

improved logic

fababbe

skonto changed the title ~~[wip] If deployment is never available propagate the container msg~~ If deployment is never available propagate the container msg Jun 4, 2024

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2024

skonto mentioned this pull request Jun 12, 2024

Add pod diagnostics before scaling down to zero in scaler #15326

Closed

skonto closed this Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If deployment is never available propagate the container msg #14835

If deployment is never available propagate the container msg #14835

skonto commented Jan 24, 2024 •

edited

Loading

codecov bot commented Jan 24, 2024 •

edited

Loading

skonto Jan 25, 2024 •

edited

Loading

skonto Jan 25, 2024

skonto Jan 25, 2024 •

edited

Loading

dprotaso commented Jan 25, 2024

skonto commented Jan 25, 2024 •

edited

Loading

dprotaso commented Jan 25, 2024

dprotaso commented Jan 25, 2024

skonto commented Feb 2, 2024

skonto commented Feb 5, 2024

skonto commented Feb 5, 2024

ReToCode Feb 5, 2024

skonto Feb 5, 2024 •

edited

Loading

ReToCode Feb 6, 2024

ReToCode Feb 5, 2024

ReToCode Feb 5, 2024

skonto Feb 5, 2024

dprotaso Feb 20, 2024

skonto Feb 21, 2024

skonto Mar 28, 2024 •

edited

Loading

dprotaso Feb 21, 2024

skonto Feb 21, 2024 •

edited

Loading

dprotaso commented Mar 26, 2024

skonto commented Mar 28, 2024 •

edited

Loading

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024 •

edited

Loading

skonto commented Apr 26, 2024

knative-prow bot commented May 27, 2024 •

edited

Loading

skonto commented Jun 4, 2024

dprotaso commented Jun 6, 2024

skonto commented Jun 7, 2024 •

edited

Loading

skonto commented Jun 13, 2024

	var podCondSet = apis.NewLivingConditionSet(
	PodAutoscalerConditionActive,
	PodAutoscalerConditionScaleTargetInitialized,
	PodAutoscalerConditionSKSReady,
	)

If deployment is never available propagate the container msg #14835

If deployment is never available propagate the container msg #14835

Conversation

skonto commented Jan 24, 2024 • edited Loading

Proposed Changes

codecov bot commented Jan 24, 2024 • edited Loading

Codecov Report

skonto Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

dprotaso commented Jan 25, 2024

skonto commented Jan 25, 2024 • edited Loading

dprotaso commented Jan 25, 2024

dprotaso commented Jan 25, 2024

skonto commented Feb 2, 2024

skonto commented Feb 5, 2024

skonto commented Feb 5, 2024

Choose a reason for hiding this comment

skonto Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

dprotaso commented Mar 26, 2024

skonto commented Mar 28, 2024 • edited Loading

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024

dprotaso commented Apr 23, 2024 • edited Loading

skonto commented Apr 26, 2024

knative-prow bot commented May 27, 2024 • edited Loading

skonto commented Jun 4, 2024

dprotaso commented Jun 6, 2024

skonto commented Jun 7, 2024 • edited Loading

skonto commented Jun 13, 2024

skonto commented Jan 24, 2024 •

edited

Loading

codecov bot commented Jan 24, 2024 •

edited

Loading

skonto Jan 25, 2024 •

edited

Loading

skonto Jan 25, 2024 •

edited

Loading

skonto commented Jan 25, 2024 •

edited

Loading

skonto Feb 5, 2024 •

edited

Loading

skonto Mar 28, 2024 •

edited

Loading

skonto Feb 21, 2024 •

edited

Loading

skonto commented Mar 28, 2024 •

edited

Loading

dprotaso commented Apr 23, 2024 •

edited

Loading

knative-prow bot commented May 27, 2024 •

edited

Loading

skonto commented Jun 7, 2024 •

edited

Loading