Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] If cleanup should be able to mark build as failed, it needs to be run before finalizers #29

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Jimilian
Copy link

@Jimilian Jimilian commented Aug 19, 2016

So, according docs:

    /**
     * Return true if this {@link Publisher} needs to run after the build result is
     * fully finalized.
     *
     * <p>
     * The execution of normal {@link Publisher}s are considered within a part
     * of the build. This allows publishers to mark the build as a failure, or
     * to include their execution time in the total build time.
     *
     * <p>
     * So normally, that is the preferrable behavior, but in a few cases
     * this is problematic. One of such cases is when a publisher needs to
     * trigger other builds, which in turn need to see this build as a
     * completed build. Those plugins that need to do this can return true
     * from this method, so that the {@link #perform(AbstractBuild, Launcher, BuildListener)} 
     * method is called after the build is marked as completed.
     *
     * <p>
     * When {@link Publisher} behaves this way, note that they can no longer
     * change the build status anymore.
     *
     * @since 1.153
     */
    public boolean needsToRunAfterFinalized() {
        return false;
    }

So, if needsToRunAfterFinalized returns true. Publisher is not able anymore to change build result (code from last Jenkins, but it's quite old) and any negative result would be just logged:

protected final boolean performAllBuildSteps(BuildListener listener, Iterable<? extends BuildStep> buildSteps, boolean phase) throws InterruptedException, IOException {
    boolean r = true;
    for (BuildStep bs : buildSteps) {
        if ((bs instanceof Publisher && ((Publisher)bs).needsToRunAfterFinalized()) ^ phase)
            try {
                if (!perform(bs,listener)) {
                    LOGGER.log(Level.FINE, "{0} : {1} failed", new Object[] {AbstractBuild.this, bs});
                    r = false;
                    if (phase) { // Phase would be "false" if needsToRunAfterFinalized returns "true"
                        setResult(Result.FAILURE);
                    }
                }
    ...
}

Unfortunately, I failed to write test for this fix, so, if you have any suggestion how to perform it - it would be nice. But I tested it manually and it works as expected - marks build as failed if needed.

@Jimilian
Copy link
Author

@olivergondza, @oleg-nenashev what do you think about it?

@olivergondza
Copy link
Member

I agree this is what need to be done in order to implement the fail the build functionality. But, it can cause other publishers to misbehave as the workspace cleanup can run before then purging the data they want. What makes this even more surprising is (IIRC) the order of publisher execution in is not the order of their declaration.

To me it seems more important to perform cleanup as the very last thing than implementing fail the build functionality. I am actually starting to think the method is implemented the way it is for a reason.

@Jimilian
Copy link
Author

@olivergondza In our environment it works as expected. So, it looks like order of execution is same as order of declaration. Maybe somebody from Cloudbees can help us to understand guarantees. From code perspective you have a list of publishers, but when it performs method accepts any Iterable, so... some clarification is needed.

Without my PR fail the build functionality doesn't work at all, so if you don't want to merge this PR, it's better to remove such option from plugin, because it's misleading users.

@oleg-nenashev
Copy link
Member

What makes this even more surprising is (IIRC) the order of publisher execution in is not the order of their declaration.

It should not be a case for all more or less modern Jenkins core versions. On very old Jenkins cores (1.400?) there was a checkbox-based selection of post-build actions. In such case there was no order of definitions. If the job has been migrated from such old core version and never saved afterwards, then Publisher execution order may differ from the order in UI. But it's a not realistic case IMHO

@oleg-nenashev
Copy link
Member

oleg-nenashev commented Aug 23, 2016

Actually, the Build status change behavior changed in Jenkins 1.584. jenkinsci/jenkins@28dfd90 . So this plugin likely behaves incorrectly after it. We already had such issues in other plugins (I do not recall which ones).

So I agree that the fix is required in WS Cleanup plugin. The proposed one is not ideal since the plugin behavior may differ depending on the advanced setting. Maybe switching to false is not so bad here

@Jimilian
Copy link
Author

Plugin behaviour would differ depending on the advanced setting anyway ;)
But yes, I agree that false could be better solution if we are talking about user expectations (order matters with any plugin configuration).

@olivergondza
Copy link
Member

@oleg-nenashev, @Jimilian, I gave it a try and publishers are not executed in the order they are declared. There is some Notifier/Recorder magic combined with needsToRunAfterFinalized() involved. Check this out:

To my observation, the implementations where needsToRunAfterFinalized == false are always executed before those that return true.

So if there is a publisher that rely on workspace and declare needsToRunAfterFinalized == true, it will blow up even when declared earlier than workspace cleanup (if it starts returning false here). Unfortunately, it is not presented to the user if post build step is notifier, recorder and is the result of needsToRunAfterFinalized so user will be clueless.

@Jimilian
Copy link
Author

@olivergondza yes, you are right that if user has two plugins: A (needsToRunAfterFinalized == false) and B (needsToRunAfterFinalized == true). A will be executed first in any case.

But for me it's clear that nobody can rely on data from workspace after finalisers. It could be only some pure notify-only actions (send e-mail, message to IM, etc).

@olivergondza
Copy link
Member

olivergondza commented Aug 23, 2016

Note that there is no link between build result being computed and workspace being available aside from this PR so I do not think you can assume there are no notification plugins that rely on workspace. Quick greping through source reveals there are such plugins: https://github.com/jenkinsci/ivy-plugin/blob/master/src/main/java/hudson/ivy/IvyBuildTrigger.java or https://github.com/jenkinsci/email-ext-plugin/blob/master/src/main/java/hudson/plugins/emailext/ExtendedEmailPublisher.java

@Jimilian
Copy link
Author

Jimilian commented Aug 23, 2016

@olivergondza, if cleanup fails I want to see how build fails, because in this case I will catch failure as soon as possible. Otherwise, I need to investigate root cause - analyse node history, etc. Last time we spent 20 hours to find root cause. Only because developer who introduced invalid character got success result.

Note that there is no link between build result being computed and workspace being available aside from this PR

It's true. I hope than Cloudbees can fill this gap. Otherwise I can't see any reason to split publishers in two groups.

Of course, it's your plugin and your rules, but guy who developed IvyBuildTrigger made pre-mature optimisation and introduced bug. I see your point that a lot of users are using this plugin and only me is interested in 'fail fast' way of working, but in this way we should be consistent and we should remove 'fail if fail to cleanup' option. And notify all users that plugin doesn't provide any feedback anymore. I think it's much more important to provide clear feedback than to support invalid pipelines.

@olivergondza
Copy link
Member

if cleanup fails I want to see how build fails, because in this case I will catch failure as soon as possible.

Definitely.

It's true. I hope than Cloudbees can fill this gap. Otherwise I can't see any reason to split publishers in two groups.

The reason is the later group can rely on build status being final. It has nothing to do with workspace.

... but guy who developed IvyBuildTrigger made pre-mature optimisation and introduced bug.

As I said, it is sane a not uncommon assumption and it is our job, as ws-cleanup contributors, to deal with that.

Let me dig into this a little bit to see what can be done to preserve the feedback to the users.

@olivergondza olivergondza self-assigned this Aug 24, 2016
@olivergondza
Copy link
Member

One more problem with running this before finalized is the post build step is making decision to clean or not to clean based on the build status that can change later. Additionally, when workspace is cleaned asynchronously (which is what we want to be turned on by default) there is no way to inform people something is wrong.

I am leaning towards representing the failure to cleanup afterwards in some other way. Failing the next build perhaps?

@olivergondza
Copy link
Member

One more though, once post build step throws an exception we must expect the WS is in inconsistent state from the users perspective (it was not wiped out or cleaned selectively as requested) and should not be reused by future builds.

An alternative approach would be to wipeout (rename and delete asynchronously) the workspace if configured cleanup fails. That should eliminate most problems with configuration (custom command does not work for instance) but still can fail in case of permissions, etc. Those problems will be reported to admin via #28 (there are danging directories occupying space somewhere) but future builds will not be affected.

@Jimilian, WDYT?

@Jimilian
Copy link
Author

Jimilian commented Aug 24, 2016

I think user should be able to make a choice (that's why in my first commit I re-used option 'fail if cleanup failed'). Because in our environment priorities are quite simple:

  1. Stability
  2. Consistency (nobody should receive negative feedback that is not related to commit itself)
  3. Speed

Also, different users prefer different plugins and it's hard to predict all ways of working.

If you agree, I can introduce global option (from Jenkins config) that changes value of needsToRunAfterFinalized. Also I can do same for async cleanup. Because valid feedback is more important for us if it's related to current commit.

You are right that after such issues workspace is in inconsistent state and it doesn't make sense to keep it. So, user can do several things even now:

  • To recover from this state user can use "cleanup before build starts" (btw, I'm going to add some timeout in this action in future), this stuff could work asynchronous as well. I'm not sure that it would be possible to change build status in case of fail in this case, also I'm not sure that it should be implemented, because it's not user fault that plugin failed to cleanup workspace after previous build.
  • Detach broken node - we have a script that monitors logs and detach nodes with broken workspace. We are using such approach because we failed to find a way to automatically recover nodes in all cases.

I think asynchronous cleaning up before build starts is better in this case, because show must go on. (And who wants can detach node if disk has less than 90% of space).
In past we had several accidents with asynchronous cleaning: disk becomes full too fast on some nodes. It happened because plugin tried to cleanup asynchronies. Moved workspace. Failed to cleanup. Didn't raise any visible feedback. Repeated several times until disk has some space. So, email to admin would be nice as well.

Extremely important: I don't expect any hidden magic on plugin side. Only some general recommendation from contributors. Otherwise, it could create new (unexpected) bugs/performance issues.

@olivergondza
Copy link
Member

In general there is no way to tell if failure to delete the workspace is a problem of infrastructure, job configuration or the code that is built.

Moved workspace. Failed to cleanup. Didn't raise any visible feedback.

Right, that should be fixed in #28.

You are right that after such issues workspace is in inconsistent state and it doesn't make sense to keep it. So, user can do several things even now:

the problem is that user needs to think ahead. I would love to be more helpful than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants