Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CephFS volumes / sub-volumes #1023

Open
4 tasks
benaryorg opened this issue Jul 19, 2024 · 19 comments · May be fixed by #1538
Open
4 tasks

Add support for CephFS volumes / sub-volumes #1023

benaryorg opened this issue Jul 19, 2024 · 19 comments · May be fixed by #1538
Labels
Feature New feature, not a bug
Milestone

Comments

@benaryorg
Copy link
Contributor

Required information

  • Distribution: NixOS
  • Distribution version: 24.05
  • The output of "incus info" or if that fails:
    • Kernel version: 6.6.40
    • LXC version: 6.0.1
    • Incus version: 6.2.0
    • Storage backend in use: CephFS

Issue description

CephFS has changed its mount string in Quincy, the version that has recently reached its estimated EoL date (current being Reef, Squid is upcoming AFAIK).
This means that any still active release (talking about upstream, not distros) has a mount string that is different from the one Incus is using right now.

This leads to users having a really hard time trying to mount CephFS created via the newer CephFS Volumes/Subvolumes mechanic (at least I haven't gotten it working yet).

As described in the discussion boards the old syntax was:

[mon1-addr]:3300,[mon2-addr]:3300,[mon3-addr]:3300:/path/to/thing

and a lot of options via the -o parameter (or the appropriate field in the mount syscall).
Notably Incus does not rely on the config file for this but manually scrapes the mon addresses out of the config file (which has its own issues because the used string matching is insufficient to catch an initial mon list which then refers to the mons by name and the mons being listed in their own sections with their addresses directly as mon_addr, which means that while mount.ceph can just mount the volume, Incus fails during the parsing phase of the config file.

The new syntax is:

[email protected]=/path/to/thing

So with the user, the (optional) fsid, and the cephfs name being encoded into the string there are a few less options, although they do still exist.

Steps to reproduce

  1. run CephFS on ≥Quincy
  2. create CephFS volume and subvolume
  3. try to mount it

With vaguely correct seeming parameters provided to Incus this will still lead to interesting issues like getting No Route to Host errors despite everything being reachable.
Honestly, if you find options that manage to mount that, please tell me because I can't seem to find any.

Information to attach

Any relevant kernel output (dmesg)
[ +13.628392] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER)
[  +0.271853] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER)
[  +0.519922] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER)
[  +0.520979] ceph: No mds server is up or the cluster is laggy
Main daemon log (at /var/log/incus/incusd.log)
Jul 19 20:32:09 lxd2 incusd[10412]: time="2024-07-19T20:32:09Z" level=error msg="Failed mounting storage pool" err="Failed to mount \"[2001:41d0:700:2038::1:0]:3300,[2001:41d0:1004:1a22::1:1]:3300,[2001:41d0:602:2029::1:2]:3300:/\" on \"/var/lib/incus/storage-pools/cephfs\" using \"ceph\": invalid argument" pool=cephfs
  • Container log (incus info NAME --show-log)
  • Container configuration (incus config show NAME --expanded)
  • Output of the client with --debug
  • Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue) (doesn't really log anything about the issue)
@benaryorg
Copy link
Contributor Author

benaryorg commented Jul 19, 2024

For completeness sake, here are some commands to get a new CephFS volume and subvolume stuff up and running and how the final mount command might look like (I'm fumbling that out of my history, not guaranteed to be 100% accurate):

ceph fs volume create volume-name
ceph fs subvolumegroup create volume-name subvolume-group-name
ceph fs subvolume create volume-name subvolume-name --group_name subvolume-group-name

# this will now spit out a path including the UUID of the subvolume:
ceph fs subvolume getpath volume-name subvolume-name --group_name subvolume-group-name
# then authorize a new client (syntax changes slightly in upcoming version)
ceph fs authorize volume-name client.client-name /volumes/subvolume-group-name/subvolume-name/e7c5cd0c-10fa-42e2-9d48-902544f13d07 rw
# which can be mounted like (fsid can be omitted if it is in ceph.conf, key will be read from keyring in /etc/ceph too):
mount -t ceph [email protected]=/volumes/subvolume-group-name/subvolume-name/e7c5cd0c-10fa-42e2-9d48-902544f13d07 /mnt

@tregubovav-dev
Copy link

Just a question: what is the use-case blocked?
I actively use CephFS storage pool with my Incus + Microceph deployment (as well as with LXD + Microceph in the past) and I do not see any issues. All such volumes mounted to the instances.

@stgraber stgraber added this to the later milestone Jul 20, 2024
@stgraber stgraber added the Feature New feature, not a bug label Jul 20, 2024
@stgraber stgraber changed the title CephFS mount uses deprecated mount-string, fails to mount Volume based CephFS Add support for CephFS volumes / sub-volumes Jul 20, 2024
@benaryorg
Copy link
Contributor Author

benaryorg commented Jul 20, 2024

Just a question: what is the use-case blocked? I actively use CephFS storage pool with my Incus + Microceph deployment (as well as with LXD + Microceph in the past) and I do not see any issues. All such volumes mounted to the instances.

How does your storage configuration look like?
I've tried several permutations that looked like they could work, but considering that I also managed to drop down to incus admin sql to be able to delete the storage pool (which got stuck in pending forever) once I did not try everything.

@tregubovav-dev
Copy link

tregubovav-dev commented Jul 20, 2024

How does your storage configuration look like? I've tried several permutations that looked like they could work, but considering that I also managed to drop down to incus admin sql to be able to delete the storage pool (which got stuck in pending forever) once I did not try everything.

My cluster configuration is:

        NAME        DRIVER  DESCRIPTION  USED BY   STATE
  remote            ceph                 62       CREATED
  shared_vols       cephfs               11       CREATED
  test_shared_vols  cephfs               1        CREATED
  • test_shared_vols configuration
config:
  cephfs.cluster_name: ceph
  cephfs.path: lxd_test_shared
  cephfs.user.name: admin
description: ""
name: test_shared_vols
driver: cephfs
used_by:
- /1.0/storage-pools/test_shared_vols/volumes/custom/test_vol1?project=test
status: Created
locations:
- cl-06
- cl-07
- cl-01
- cl-02
- cl-03
- cl-04
- cl-05

Steps to create storage pool and deploy instances with sharing files using CephFS volumes

  1. You need to have existing cephfs volume. In my case it's:
$sudo ceph fs ls
name: lxd_test_shared, metadata pool: lxd_test_shared_pool_meta, data pools: [lxd_test_shared_pool_data ]
  1. Create storage pool:
for i in {1..7}; do incus storage create test_shared_vols cephfs source=lxd_test_shared --target cl-0$i; done \
  && incus storage create test_shared_vols cephfs
  1. You can create storage volume when pool is created. (I sue separate project for volume and instances used it):
    incus storage volume create test_shared_vols test_vol1 size=256MiB --project test
  2. Create and instances, and attach volume to them:
for i in {1..7}; do inst=test-ct-0$i; \
  echo "Launching instance: $inst"; incus launch images:alpine/edge $inst --project test; \
  echo "Attaching 'test_vol1' to the instance"; incus storage volume attach test_shared_vols test_vol1 $inst data "/data" --project test; \
  echo "Listing content of '/data' directory:"; incus exec $inst --project test -- ls -l /data; \
  done
  1. Put a file to shared volume using:
    incus exec test-ct-04 --project test -- sh -c 'echo -e "This is a file\n placed tothe shared volume.\n It is acceccible from any instance where this volume is attached.\n" > /data/test.txt'
  2. Check file existence and it's content on each node:
for i in {1..7}; do inst=test-ct-0$i; echo "Listing content of '/data' directory in the $inst instance"; incus exec $inst --project test -- ls -l /data; done

for i in {1..7}; do inst=test-ct-0$i; echo "--- Printing content of '/data/test.txt' file in the $inst instance ---"; incus exec $inst --project test -- cat /data/test.txt; done

@benaryorg
Copy link
Contributor Author

So far it does not look like you are using the ceph fs volume feature (at least not with subvolumes), otherwise your CephFS paths would include a UUID somewhere. Besides, using the admin credentials would side-step any mounting issues that I'm seeing because you would be able to mount the root of the CephFS even if trying to mount a CephFS subvolume. If you create a subvolume as per my first reply in the post, then you will have credentials that do not have access to the root of the CephFS, making you unable to use the storage configuration you provided (since that one does not contain any paths, and therefore would fail to mount for lack of permissions) as far as I can tell.

@tregubovav-dev
Copy link

So far it does not look like you are using the ceph fs volume feature

Yes, you are correct. This why I asked about your use-case.

@benaryorg
Copy link
Contributor Author

Yes, you are correct. This why I asked about your use-case.

Ah, I see.
The primary advantage to me personally is that I don't have to manually lay out a directory structure (i.e. I do not have to actually mount the CephFS with elevated privileges such as client.admin to administrate it), the quota support is baked in, and authorization of individual clients for shares becomes programmatic over that specific API (i.e. less worrying about adding or removing caps outside the CephFS system).

If I were to automate Incus cluster deployment (or even just deployment for individual consumers of CephFS, and also want to handle Incus in the same way), I could instead use the Restful API module of the MGR for many operations in a way that is much less error prone than the API is for managing CephFS otherwise; I wouldn't need to create individual directory trees, and I would not have to enforce a certain convention for how the trees are laid out (since volumes have their very specific layout). Quota management also becomes less of a "have to write xattr of specific directory" and much more tightly attached to the subvolume. The combination of getpath and the way the auth management is handled also makes it a little harder to accidentally use the wrong path or something. This is mostly about automation and programmatically handling things, which is in line with what OpenStack Manila wants for its backend.

Especially when administrating a Ceph cluster on a team with several admins however the added constraints make it much easier to work as a team since there are no strict conventions to stick to oneself, because Ceph already enforces those.

Being able to create multiple volumes, each of which comes with its own pools and MDSs, also greatly improves how things work when you have to separate tenants for whatever reason.
Given that it's often beneficial to run one big Ceph cluster instead of many small ones (due to the increase in failure domains) I can see how some of the customers I worked with would like to use that feature (granted, none of those customers were using Incus though), and with any newer clusters I would absolutely recommend using volumes even if just for the reason that you don't have to go back and reintroduce and clean up every part where things weren't properly separated later on (since inevitably every user of Ceph at some points needs some level of isolation for whatever reason, I've never not seen it happen).

In short; it makes me not trip over my own feet when adding a new isolated filesystem share by taking care of the credential-management, directory creation, and quotas, something which I'd surely manage to at least once mess up and like.… delete the client.ceph credentials or something (which wouldn't be possible with the ceph fs deauthorize command as far as I can tell).

TL;DR: it's just more robust as soon as you need to have separate shares for different clients and makes managing the cluster easier if there is a strong separation of concerns.

@tregubovav-dev
Copy link

Ah, I see.

I appreciated your detailed explanation.

@MadnessASAP
Copy link
Contributor

I have run into this issue as well. Or at least the old vs new syntax. My ceph.conf uses DNS for monitor addresses which incus is trying to pass on without resolving:

# incus start ceph-test
Error: Failed to start device "fs-abc": Unable to mount "mon1.example.net:6789,mon2.example.net:6789,mon3.example.net:6789:/abc" at "/var/lib/incus/devices/ceph-test/disk.fs--abc.gitea" with filesystem "ceph": invalid argument

This also generates a kernel error message:

kernel: libceph: Failed to parse monitor IPs: -22

It would seem the sensible solution is to use the new syntax and not attempt to parse the ceph config file.

@MadnessASAP
Copy link
Contributor

I did a bit of poking into the Incus and Ceph code, it seems that the mount.ceph CLI helper does a lot more lifting that I thought it did.

I think I could probably make it better, and open a PR. However it would involve either shelling out to mount.ceph on the host or pulling in go-ceph to do most of the work.

Perhaps @stgraber could provide some insight on a preferred approach?

@stgraber
Copy link
Member

stgraber commented Dec 5, 2024

Using mount.ceph should be fine.

Last I checked, go-ceph was dynamically linking against libceph which then would require everyone using Incus to have the Ceph headers available and have libceph installed at run time. This doesn't seem ideal. The build time, we could work with, but requiring libceph at runtime is more of an issue, so unless they move that logic under dlopen or similar so it can only be done on systems where Ceph is installed, using the CLI will be easier.

@MadnessASAP
Copy link
Contributor

To the shell it is then! I didn't realize the nuance regarding go-ceph, I don't suppose there's a mechanism for external storage drivers in the works? That would perhaps offer the best of both worlds.

@stgraber
Copy link
Member

stgraber commented Dec 5, 2024

Not currently. Storage drivers are pretty much constantly hammered, so running those out of process would basically require them running constantly. We also generally try to avoid plugin mechanisms as much as possible in favor of high quality 1st party integration.

The problem with starting to have plugin APIs, even if only meant for internal components is that folks will almost immediately start (ab)using them for other stuff and get mad when we then break them ;)

@MadnessASAP
Copy link
Contributor

MadnessASAP commented Dec 7, 2024

#1473

It's not much but it's a start. That it compiles at all is pretty neat given that I taught myself just enough go to draft it.

edit: even better! it worked!!

@stgraber
Copy link
Member

stgraber commented Dec 7, 2024

Nice!
I just kicked a CI test on that PR so we'll see what our ceph testsuite thinks of it :)

@MadnessASAP
Copy link
Contributor

I would also like to hear from @benaryorg as I don't believe their use case is covered by mine or CI. This is their "issue" after all.

Would you/they be willing to try the draft PR and see if/where it fails?

@benaryorg
Copy link
Contributor Author

Given that I'm sorta stuck on LTS and that I'm not sure that I would be affected by irreversible schema changes, I'll have to see if I can spin up a test for this outside my production infrastructure.
It may take a bit until I get around to doing that but it's definitely on my todo list.

If the patch were to apply cleanly on the current version I use (6.0.2) I could test a patched version without all that hassle, I just haven't tested that yet (so this is more of a note to myself than anything).

@MadnessASAP
Copy link
Contributor

Given that the most current iteration seems to have not just broken the Ceph tests but all the tests, I'm gonna say it's probably not going to apply cleanly. Probably shouldn't be anywhere near production infra either.

I'm trying to keep breaking changes to a bare minimum but one way or another I think somethings going to break, 1172 seems applicable to this situation.

@MadnessASAP MadnessASAP linked a pull request Dec 21, 2024 that will close this issue
@MadnessASAP
Copy link
Contributor

@benaryorg You may appreciate this recent work on my branch and now draft PR:

michael ~> ceph fs ls
name: cephfs, metadata pool: cephfs.meta, data pools: [cephfs.data ]
name: test-fs, metadata pool: cephfs.test-fs.meta, data pools: [cephfs.test-fs.data ]
michael ~> incus storage create ceph2 cephfs source=test-fs/
Storage pool ceph2 created
michael ~> incus storage show ceph2
config:
  cephfs.cluster_name: ceph
  cephfs.path: test-fs/
  cephfs.user.name: admin
  source: test-fs/
description: ""
name: ceph2
driver: cephfs
used_by: []
status: Created
locations:
- none
michael ~> ceph fs subvolume create test-fs test-subvol
michael ~> ceph fs subvolume getpath test-fs test-subvol
/volumes/_nogroup/test-subvol/81e9f4f7-e108-4f8e-8d28-15f0737b1262
michael ~> incus storage create ceph2-subvol cephfs source=test-fs/volumes/_nogroup/test-subvol/81e9f4f7-e108-4f8e-8d28-15f0737b1262
Storage pool ceph2-subvol created

Still some cleanup work to do, however it is currently testing quite nicely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature, not a bug
Development

Successfully merging a pull request may close this issue.

4 participants