Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If deployment is never available propagate the container msg #14835

Closed
wants to merge 9 commits into from

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Jan 24, 2024

Fixes #14157

Proposed Changes

  • This propagates the msg of the container t when deployment never reaches availability and keeps having:
        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "Ready"
        }

Then the ksvc when the deployment is scaled back to zero will have:


           "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T09:18:31Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/helloworld-go@sha256:dd20d7659c16bdc58c09740a543ef3c36b7c04742a2b6b280a30c2a76dcf6c09\".",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T09:14:44Z",
                        "message": "Revision \"helloworld-go-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
                ],


  • This changes the initial state from unknown to "failed" even for normal ksvcs, however once we are up this is cleared at the configuration level as well. The idea for this initial state comes from the fact that K8s for readiness probes also considers a probe failed until InitialDelaySeconds pass.
    Right now we have when a ksvc is deployed, at the beginning of it lifecycle:
                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "status": "Unknown",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:34:55Z",
                        "message": "Configuration \"helloworld-go\" is waiting for a Revision to become ready.",
                        "reason": "RevisionMissing",
                        "status": "Unknown",
                        "type": "RoutesReady"
                    }

With this PR we start with a failing status until it is cleared:

                "conditions": [
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Revision \"helloworld-go-00001\" failed with message: .",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-04-02T10:31:01Z",
                        "message": "Configuration \"helloworld-go\" does not have any ready Revision.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
  • To reproduce the initial issue with minikube you can use the following:
    • apply a ksvc
    • wait for a pod to come up and then ksvc to scale to zero
    • minikube image list and then minikube image rm so that no image is available within minikube for the user container.
    • block any internet access so image can't be pulled
    • issue a request via curl ...
    • observe the revision and deployment statuses
  • When the issue is resolved the next request will clear the status messages:
               "conditions": [
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "ConfigurationsReady"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "Ready"
                   },
                   {
                       "lastTransitionTime": "2024-04-02T09:37:03Z",
                       "status": "True",
                       "type": "RoutesReady"
                   }

@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2024
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jan 24, 2024
@skonto skonto removed the request for review from evankanderson January 24, 2024 14:41
Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: Patch coverage is 48.00000% with 13 lines in your changes missing coverage. Please review.

Project coverage is 84.68%. Comparing base (bb51203) to head (fababbe).

Files Patch % Lines
pkg/apis/serving/v1/revision_lifecycle.go 0.00% 11 Missing ⚠️
pkg/reconciler/revision/reconcile_resources.go 81.81% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14835      +/-   ##
==========================================
- Coverage   84.75%   84.68%   -0.08%     
==========================================
  Files         218      218              
  Lines       13482    13502      +20     
==========================================
+ Hits        11427    11434       +7     
- Misses       1688     1700      +12     
- Partials      367      368       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 25, 2024
@@ -52,13 +52,10 @@ func TestImagePullError(t *testing.T) {
cond := r.Status.GetCondition(v1.ConfigurationConditionReady)
if cond != nil && !cond.IsUnknown() {
if cond.IsFalse() {
if cond.Reason == wantCfgReason {
if cond.Reason == wantCfgReason && strings.Contains(cond.Message, "Back-off pulling image") {
Copy link
Contributor Author

@skonto skonto Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the configuration will be go from status:

k get configuration  -n serving-tests 
NAME                        LATESTCREATED                     LATESTREADY   READY     REASON
image-pull-error-dagmxojy   image-pull-error-dagmxojy-00001                 Unknown 

to status failed after the progressdeadline was exceeded (120s in tests).
Here instead, due to this the patch, as soon as it sees no availability at the deployment side it will mark the revision as ready=false and configuration will get:

                    {
                        "lastTransitionTime": "2024-01-25T13:44:18Z",
                        "message": "Revision \"image-pull-error-kbfvvcsg-00001\" failed with message: Deployment does not have minimum availability..",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "Ready"
                    }

Later on after progressdeadline is passed it will get expected msg here.
That is why we will have to wait.

                "conditions": [
                    {
                        "lastTransitionTime": "2024-01-25T13:36:08Z",
                        "message": "Revision \"image-pull-error-gnwactac-00001\" failed with message: Deployment does not have minimum availability..",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "Ready"
                    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to cover rpc error: code = NotFound desc = failed to pull and unpack image as well, as seen from the failures.

@@ -549,6 +549,30 @@ func TestReconcile(t *testing.T) {
Object: pa("foo", "pull-backoff", WithReachabilityUnreachable),
}},
Key: "foo/pull-backoff",
}, {
Name: "surface ImagePullBackoff when previously scaled ok but now image is missing",
Copy link
Contributor Author

@skonto skonto Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers the case where we are not in a rolout eg. scaling from zero and image is not there.
The goal is that for this scenario when we scale down to zero we will have:

        {
            "lastTransitionTime": "2024-01-23T20:16:43Z",
            "message": "The target is not receiving traffic.",
            "reason": "NoTraffic",
            "severity": "Info",
            "status": "False",
            "type": "Active"
        },
        {
            "lastTransitionTime": "2024-01-23T20:03:07Z",
            "status": "True",
            "type": "ContainerHealthy"
        },
        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "Ready"
        },
        {
            "lastTransitionTime": "2024-01-23T20:05:38Z",
            "message": "Deployment does not have minimum availability.",
            "reason": "MinimumReplicasUnavailable",
            "status": "False",
            "type": "ResourcesAvailable"
        }

instead of

    {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "severity": "Info",
             "status": "True",
             "type": "Active"
         },
         {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "status": "True",
             "type": "ContainerHealthy"
         },
         {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "status": "True",
             "type": "Ready"
         },
         {
             "lastTransitionTime": "2024-01-23T20:03:07Z",
             "status": "True",
             "type": "ResourcesAvailable"
         }

so user knows that something was not ok.

@dprotaso
Copy link
Member

The earlier concerns with using Available is that it changes when the revision is scaling. Thus Available=False when scaling up until all the pods are ready.

We don't want to mark the revision as ready=false when scaling is occuring. I haven't dug into the code changes in the PR yet but how do we handle that scenario?

@skonto
Copy link
Contributor Author

skonto commented Jan 25, 2024

We don't want to mark the revision as ready=false when scaling is occuring.

My goal is to only touch status when progressdeadline does not seems to work (kubernetes/kubernetes#106054). That means only apply when if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 holds and there is some waiting happening which means in that case nothing is ready actually. Also I only set the revision availability to false if it is not false already, otherwise I don't touch it.

@dprotaso
Copy link
Member

Yeah ProgressDeadline only seems to apply when doing a rollout from one replicaset to another one

@dprotaso
Copy link
Member

I discovered that here
kubernetes/kubernetes#106697

@skonto skonto changed the title [WIP] If deployment is never available propagate the msg If deployment is never available propagate the msg Feb 2, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2024
@skonto
Copy link
Contributor Author

skonto commented Feb 2, 2024

/test istio-latest-no-mesh

@skonto
Copy link
Contributor Author

skonto commented Feb 5, 2024

infra
/test istio-latest-no-mesh

@skonto
Copy link
Contributor Author

skonto commented Feb 5, 2024

@dprotaso @ReToCode gentle ping

m := revisionCondSet.Manage(rs)
avCond := m.GetCondition(RevisionConditionResourcesAvailable)

// Skip if set for other reasons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other reasons? Why would we need to skip in that case?

Copy link
Contributor Author

@skonto skonto Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change the current state machine see comment: #14835 (comment). So I am only targeting a specific case.
Now if for example the deployment faced an issue and revision avail condition is set to false already, I am not going to update it and set it to false again, I am just skipping the update and keep things as is.
In general it should not make a difference as if that was an intermediate state to have avail cond false (it happens when replicas are not ready) for some other reason, then it is going to be reset to true and then later we will set it to false again anyway.
I can try without this in a test PR but I am wondering for side effects in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the I think the comment was just not explaining this fully.

@@ -137,6 +137,12 @@ func MarkInactive(reason, message string) RevisionOption {
}
}

func MarkActiveUknown(reason, message string) RevisionOption {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@@ -92,3 +90,8 @@ func createLatestConfig(t *testing.T, clients *test.Clients, names test.Resource
c.Spec = *v1test.ConfigurationSpec(names.Image)
})
}

func hasPullErrorMsg(msg string) bool {
return strings.Contains(msg, "Back-off pulling image") ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do these strings come from? How do we know it's all of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not hard by trial and error, see previous PR here: #4348.
I know it it not that great but I don't expect it to be that of a problem.
Alternatively we can make it independent of the string, so I will have to wait for the configuration to fail and then wait also for revision to have the right status. The reason I have it this way is that if I dont check for the right configuration status with the right msg it goes quickly to check the revision and since there is no wait right now for the revision check it will fail immediately (at the revision check).
Note here that with this patch we change the initial status to be false for the configuration so the first configuration check in the test passes quickly.

Comment on lines 229 to 231
func (pas *PodAutoscalerStatus) MarkNotReady(reason, mes string) {
podCondSet.Manage(pas).MarkUnknown(PodAutoscalerConditionReady, reason, mes)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this isn't needed because - marking SKSReady=False will mark the PA Ready=False

var podCondSet = apis.NewLivingConditionSet(
PodAutoscalerConditionActive,
PodAutoscalerConditionScaleTargetInitialized,
PodAutoscalerConditionSKSReady,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will check it.

Copy link
Contributor Author

@skonto skonto Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in tests (table_test.go) to show the expected status for pa but probably can be removed.

logger.Infof("marking resources unavailable with: %s: %s", w.Reason, w.Message)
rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message)
} else {
rev.Status.PropagateDeploymentAvailabilityStatusIfFalse(&deployment.Status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status message on minAvailable isn't very informative. In the reported issue I'm wondering if it's better to surface ErrImagePull/ImagePullBackOff from the Pod.

I believe that's what's the original code intended - but hasDeploymentTimedOut is broken because it doesn't take into account a deployment not changing and scaling from zero. It seems like that's what we should be addressing here

Copy link
Contributor Author

@skonto skonto Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it generic just to show that availability was never reached. Let me check if I can get the container status and how that works.

@dprotaso
Copy link
Member

hey @skonto are you still working on this?

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2024
@skonto
Copy link
Contributor Author

skonto commented Mar 28, 2024

hey @skonto are you still working on this?

@dprotaso yes I will take a look to update it based on your comment here: #14835 (comment).

@dprotaso
Copy link
Member

/retest

@dprotaso
Copy link
Member

Looks like the failures are legit in the api package

@dprotaso
Copy link
Member

dprotaso commented Apr 23, 2024

/hold for release tomorrow (if prow works ;) )

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2024
@skonto
Copy link
Contributor Author

skonto commented Apr 26, 2024

It needs more work.

@skonto skonto changed the title If deployment is never available propagate the container msg [wip] If deployment is never available propagate the container msg Apr 26, 2024
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2024
Copy link

knative-prow bot commented May 27, 2024

@skonto: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
certmanager-integration-tests_serving_main 3b7524c link true /test certmanager-integration-tests

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@skonto skonto force-pushed the propagate_rev_not_available branch from 3b7524c to 1106210 Compare June 4, 2024 09:17
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 4, 2024
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 4, 2024
@skonto skonto changed the title [wip] If deployment is never available propagate the container msg If deployment is never available propagate the container msg Jun 4, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2024
@skonto
Copy link
Contributor Author

skonto commented Jun 4, 2024

@dprotaso pls review. Probably this needs a release comment.

@dprotaso
Copy link
Member

dprotaso commented Jun 6, 2024

Testing this out it seems like regular cold starts cause revision/config/ksvc to all become failed (ready=false) and then once the pod is running it flips back to ready=true. Which I don't think is ideal.

I think what we really want is to signal when it's failed to scale but only after the progressdeadline. Maybe that happens as a separate status (warning?) condition unsure.

@skonto
Copy link
Contributor Author

skonto commented Jun 7, 2024

Which I don't think is ideal.

It changes the behaviour by design, as stated in the description it starts with false. Anyway I will check if we can have something less intrusive.

I think what we really want is to signal when it's failed to scale but only after the progressdeadline.

Probably we could do that at the scaler side when activating and before we scale down to zero by using the pod accessor and inspecting pods directly I suspect. It should be less intrusive.

@skonto
Copy link
Contributor Author

skonto commented Jun 13, 2024

closing in favor of #15326

@skonto skonto closed this Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error for failed revision is not reported due to scaling to zero
4 participants