Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy does not trigger automatic software install if host was removed from software scope then re-added (using labels) #25071

Open
jmwatts opened this issue Dec 31, 2024 · 3 comments
Assignees
Labels
bug Something isn't working as documented #g-software Software product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~unreleased bug This bug was found in an unreleased version of Fleet.
Milestone

Comments

@jmwatts
Copy link
Member

jmwatts commented Dec 31, 2024

Fleet version: v4.62.0

Web browser and operating system: Chrome 131.0.6778.205 on macOS


💥  Actual behavior

Policy runs and fails but software install is not triggered - there is no pending install for the software title queued in the Upcoming activity

🧑‍💻  Steps to reproduce

  1. Add a label to an existing macOS host (ex: Testing scope)
  2. Go to Software >> Add software
  3. Choose either Fleet-maintained or Custom package
  4. Choose "Automatic" for "Install" method
  5. Choose "Custom" for "Target" and select "Include any"
  6. Select label from step 1
  7. Click Add software
  8. On the host detail page, view "Upcoming" activity for the host after clicking "Refetch"
  9. Once the vitals have been refetched and you see a pending install for your software title queued in the Upcoming activity section, QUICKLY change the scope for the software title to "Exclude any" and the label from step 1 (need to do this before the installation starts on the host device)
  10. Confirm the pending install action has been removed from the Upcoming activity and the software is not showing as a completed install in "Past" activity
  11. Change the scope for the software title back to "Include any" and the label from step 1, save
  12. Refetch vitals again for the policy to run again. Can also trigger the policy to run from Policies >> Run.

🕯️ Expected results

Now that the host is back within the scope of the software target via label, the failed policy should trigger a software install on the host.

@jmwatts jmwatts added bug Something isn't working as documented :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-mdm MDM product group ~unreleased bug This bug was found in an unreleased version of Fleet. :incoming New issue in triage process. labels Dec 31, 2024
@jmwatts jmwatts added this to the 4.62.0-tentative milestone Dec 31, 2024
@jmwatts
Copy link
Member Author

jmwatts commented Jan 2, 2025

Note:
While testing other auto-install workflows I discovered if you remove the software after it's installed but before the policy has run again and passed it will never re-issue an install command. That workflow looks like this:

  1. Install software triggered by policy
  2. Software is installed on host
  3. Policy is still marked as failed from previous refetch
  4. do NOT refetch
  5. delete software from host (as end user)
  6. Refetch

When you do those steps, the software won't be reinstalled. I think this may be similar because the policy runs once and fails, identifies the software is not installed, and issues the install command. Because we cancel that command as soon as we change the scope, when the policy fails again, it doesn't re-issue the install command.

I see in our doc Automatic software install in fleet it says:
"Fleet will send install requests to the hosts on the first policy failure (first "No" result for the host) or if a policy goes from "Yes" to "No". On this iteration it will not send an install request if a policy is already failing and continues to fail ("No" -> "No"). See the following flowchart for details."
I don't see a ticket for another iteration to make this work for automatic software installs. Is there one somewhere? @noahtalerman
I feel there should definitely be one, because this is a state where software should be installed on a host but it won't be depending on timing, and in this case, a change to scope.

@mostlikelee mostlikelee added #g-software Software product group and removed #g-mdm MDM product group :incoming New issue in triage process. labels Jan 3, 2025
@iansltx
Copy link
Member

iansltx commented Jan 3, 2025

So, single-attempt installs were what was originally spec'd, and I don't think we want to change that behavior. However as future work (out of scope for this ticket) we could (likely should) provide enough information to orbit to indicate which policy an install was for, so orbit can rerun that policy's query and report back the result after a successful install (we can't just use the install success as a proxy for policy pass because the query may be looking for multiple things). Probably same with policy-initiated script runs, though that can be separate work, likely at lower priority. Idea being that we can remove the one-hour window where someone could uninstall successfully-installed policy-initiated software and then never get an install attempt until the installer or policy was modified.

With that out of the way, the current issue is that bringing hosts into scope via labels for an automation doesn't reset the policy status for those hosts, and it should, since "we brought the host in scope for a policy-automated install by changing labels for the installer" is comparable to "we brought the host in scope for a policy-automated install by adding an install automation to the policy", just for hosts in that team in (or outside) that label, rather than for hosts merely in that team.

We don't need to clear stats when a host goes out of scope via labels, as no action is required there. We do when the host comes back in scope.

ProcessInstallerSideEffects and cleanupPolicy are the places where this logic currently exists for installer changes that affect the whole team, so that would be the starting point here. Not sure whether we have fine-grained enough data structures to handle this operation correctly only for hosts affected by label changes though.

@iansltx
Copy link
Member

iansltx commented Jan 3, 2025

Per discussions earlier, what we need to do here is, on installer label scope changes:

  1. Calculate which hosts are newly covered by the revised scope (we want to diff so we aren't clearing policy statuses unnecessarily)
  2. Check which policies include those hosts AND have the installer as an automation
  3. Clear policy stats for (2)
  4. Clear host policy status for hosts from (1) X policies from (2); for consistency with behavior when adding/modifying an installer automation overall we should clear both successful and failed policy statuses

The stats/aggregations cron will repopulate policy stats, and host policy check-ins will repopulate status information, with the opportunity for hosts to go from blank to failed, triggering the automation again, solving this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as documented #g-software Software product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~unreleased bug This bug was found in an unreleased version of Fleet.
Development

No branches or pull requests

4 participants