Skip to content

propolis-server can fail to stop/reboot wedged instances #371

@jordanhendricks

Description

@jordanhendricks

One thing I noticed in my latest round of debugging #336 was that I couldn't stop or reboot the guest in question. In the case of #336, the crucible upstairs and downstairs were incompatible, and the failure mode was that only a handful of I/Os were making it through, so the guest wasn't able to do much. While this obviously isn't the happy path, I think we need to be able to stop and reboot instances that get stuck in this way.

It's easy to reproduce some form of #336 by combining a downstairs at 894d44 and an upstairs at e7ce7a. For this case, reboot seems to work, but stop failed, first with a 400, then a 500:

jordan@maxwell ~/propolis $ ./cli.sh state run  
Apr 23 20:26:23.437 INFO PUT request to http://172.20.3.73:8000/instance/state, propolis_client address: 172.20.3.73:8000
Error: failed to set instance state

Caused by:
    Bad Status: 400
jordan@maxwell ~/propolis $ 
jordan@maxwell ~/propolis $ ./cli.sh state run
Apr 23 20:26:57.455 INFO PUT request to http://172.20.3.73:8000/instance/state, propolis_client address: 172.20.3.73:8000
Error: failed to set instance state

Caused by:
    Bad Status: 500

server logs:

Apr 23 20:26:23.438 INFO Requested state Run via API, component: vm_controller                                                                                                                                     
Apr 23 20:26:23.438 INFO Queuing external request, disposition: Deny(HaltPending), request: Start, component: external_request_queue
Apr 23 20:26:23.439 INFO request completed, error_message_external: Instance operation failed: Failed to queue requested state change: Instance is preparing to stop, error_message_internal: Instance operation fa
iled: Failed to queue requested state change: Instance is preparing to stop, response_code: 400, uri: /instance/state, method: PUT, req_id: da972575-1e38-46c4-9b06-88fff4c227de, remote_addr: 172.20.3.73:54058, l
ocal_addr: 172.20.3.73:8000                

Apr 23 20:26:57.455 INFO accepted connection, remote_addr: 172.20.3.73:61958, local_addr: 172.20.3.73:8000
Apr 23 20:26:57.456 INFO request completed, error_message_external: Internal Server Error, error_message_internal: Server not initialized (no instance), response_code: 500, uri: /instance/state, method: PUT, req_id: 40a97378-4272-4907-a44b-2320c0c7c1b2, remote_addr: 172.20.3.73:61958, local_addr: 172.20.3.73:8000

I can't recall if this is the exact failure mode I saw when debugging #336, but this particular instance of it feels in the same realm as #363 (maybe?).

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiRelated to the API.bugSomething that isn't working.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions