Skip to content

ALEPH-615 Expose resources for executions#833

Open
olethanh wants to merge 1191 commits intomainfrom
ol-aleph-615-executions-resources
Open

ALEPH-615 Expose resources for executions#833
olethanh wants to merge 1191 commits intomainfrom
ol-aleph-615-executions-resources

Conversation

@olethanh
Copy link
Contributor

@olethanh olethanh commented Sep 16, 2025

Add a new resources field on existing endpoint /v2/about/executions/resources to expose resources for instances

Related ClickUp, GitHub or Jira tickets : ALEPH-615

Opening this PR so we can discuss what info we need and the format.

Examples of what this endpoint will return, this is basically the same endpoint with added resource field.

{
  "decadecadecadecadecadecadecadecadecadecadecadecadecadecadecadeca": {
    "networking": {
      "ipv4_network": "172.16.4.0/24",
      "host_ipv4": "64.227.122.196",
      "ipv6_network": "2a01:e0a:8b1:95e1:3:deca:deca:dec0/124",
      "ipv6_ip": "2a01:e0a:8b1:95e1:3:deca:deca:dec1",
      "ipv4_ip": "172.16.4.2",
      "mapped_ports": {
        "22": {
          "host": 24001,
          "tcp": true,
          "udp": false
        }
      }
    },
    "resources": {
      "vcpus": 1,
      "memory": 512,
      "disk_mib": 2048,
    },
    "status": {
      "defined_at": "2025-09-16 13:35:18.983564+00:00",
      "preparing_at": "2025-09-16 13:35:19.012594+00:00",
      "prepared_at": "2025-07-29 11:11:35.853427",
      "starting_at": "2025-07-29 11:11:35.891511",
      "started_at": "2025-09-16 13:35:19.075488+00:00",
      "stopping_at": null,
      "stopped_at": null
    }
  }
}

gdelfino and others added 30 commits February 17, 2025 14:56
Update README.md

Minor typo fixed
* Garbate collector: Free disk space from inactive VM

Add a script to manually  list and remove volume linked to inactive VM.
It fetches data from the scheduler and pyaleph main's node as to fetch information on the status of the VM.
Then display them to the user to determine if they can be removed safely.

JIRA ticket ALEPH-37

* Add diagnostic vm to ignore list
* Feature: Added options to get GPU compatibilities from a settings aggregate.

* Fix: Refactored to also return the model name from the aggregate and use the same device_id format.

* Fix: Include GPU list and move the VM egress IPv6 check on the connectivity check to start notifying the users about the next requirement.

* Fix: Solved code quality issues.

* Fix: Put definitive settings aggregate address

* Fix: Solved issue with type casting and moved the aggregate check.

* Check community payment flow (#751)

* Implement community payment check WIP

* isort

* Check community flow at allocation

* Community flow  : fix after testing

* mod Use singleton for the Setting Aggregate

* fix test

* Implement community wallet start time

---------

Co-authored-by: Olivier Le Thanh Duong <olivier@lethanh.be>
Feature: Upgrade to new `aleph-message` version
the ipv6 egress check was showing ok even while the test was failing and returning result:False
Feature: Create Ubuntu 24.04 QEMU runtime
CRN with bad IPv6 config were having bad metrics, as /status/check/fastapi was too slow which looking further into it , came from /vm/63faf8b5db1cf8d965e6a464a0cb8062af8e7df131729e48738342d956f29ace/internet always taking 5 seconds, even if it ultimately returned positive resurlt.

Coincidentally 5 seconds is the timeout configured in check_url inside the diagnostic vm (example_fastapi/main.py)

To reproduce issue
set the DNS server via environment variable
```env
ALEPH_VM_DNS_NAMESERVERS = '[2001:41d0:3:163::1, 9.9.9.9]'
```
that ipv6 server being unreachable
and set ip pool to default one
```
ALEPH_VM_IPV6_ADDRESS_POOL = fc00:1:2:3::/64
```

Solution:

Further analysis is that the diagnostic VM tried to connect to the ipv6 DNS server, did timout after  5 seconds then proceeded normally with the ipv4 DNS

Put the ipv4 DNS server BEFORE the ipv6 one.
Also disable the ipv6 server if ipv6 is not enabled for the VM. That modification was done for instances but failed to be reproduced for programs
* Fix start error in  load_persistent_executions

Fix parse error that prevented Supervisor from starting when loading
persistant executions, as it could not parse the gpu field

    Traceback (most recent call last):
      File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
        exec(code, run_globals)
      File "/opt/aleph-vm/aleph/vm/orchestrator/__main__.py", line 4, in <module>
        main()
      File "/opt/aleph-vm/aleph/vm/orchestrator/cli.py", line 379, in main
        supervisor.run()
      File "/opt/aleph-vm/aleph/vm/orchestrator/supervisor.py", line 184, in run
        asyncio.run(pool.load_persistent_executions())
      File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
        return loop.run_until_complete(main)
      File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/opt/aleph-vm/aleph/vm/pool.py", line 252, in load_persistent_executions
        execution.gpus = parse_raw_as(List[HostGPU], saved_execution.gpus)
      File "pydantic/tools.py", line 74, in pydantic.tools.parse_raw_as
        obj = load_str_bytes(
      File "pydantic/parse.py", line 37, in pydantic.parse.load_str_bytes
        return json_loads(b)
      File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
        raise TypeError(f'the JSON object must be str, bytes or bytearray, '
    TypeError: the JSON object must be str, bytes or bytearray, not NoneType

* Fix packaging issue caused by sevctl
… notify endpoint rejects the VM allocation because a floating point calculation on the community flow percentage.

Solution: Apply a floor rule on price calculation according to frontend/CLI price calculation.
This fix issue reported by node owner where if they enable a ipv6 and it's not working, it also break the diagnostic /internet check
Error in CI https://github.com/aleph-im/aleph-vm/actions/runs/13458482852/job/37788353003

 pip3 install --progress-bar off --target ./aleph-vm/opt/aleph-vm/ 'aleph-message==0.6' 'eth-account==0.10' 'sentry-sdk==1.31.0' 'qmp==1.1.0' 'aleph-superfluid~=0.2.1' 'sqlalchemy[asyncio]>=2.0' 'aiosqlite==0.19.0' 'alembic==1.13.1' 'aiohttp_cors==0.7.0' 'pyroute2==0.7.12' 'python-cpuid==0.1.0' 'solathon==1.0.2' 'protobuf==5.28.3'
Collecting aleph-message==0.6
  Downloading aleph_message-0.6.0-py3-none-any.whl (17 kB)
Collecting eth-account==0.10
  Downloading eth_account-0.10.0-py3-none-any.whl (109 kB)
Collecting sentry-sdk==1.31.0
  Downloading sentry_sdk-1.31.0-py2.py3-none-any.whl (224 kB)
Collecting qmp==1.1.0
  Downloading qmp-1.1.0-py3-none-any.whl (11 kB)
Collecting aleph-superfluid~=0.2.1
  Downloading aleph_superfluid-0.2.1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... 25l-� �done
25hCollecting sqlalchemy[asyncio]>=2.0
  Downloading SQLAlchemy-2.0.38-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
Collecting aiosqlite==0.19.0
  Downloading aiosqlite-0.19.0-py3-none-any.whl (15 kB)
Collecting alembic==1.13.1
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
Collecting aiohttp_cors==0.7.0
  Downloading aiohttp_cors-0.7.0-py3-none-any.whl (27 kB)
Collecting pyroute2==0.7.12
  Downloading pyroute2-0.7.12-py3-none-any.whl (460 kB)
Collecting python-cpuid==0.1.0
  Downloading python-cpuid-0.1.0.tar.gz (21 kB)
  Installing build dependencies ... 25l-� �\� �|� �/� �-� �done
25h  Getting requirements to build wheel ... 25l-� �error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [40 lines of output]

      An error occurred while building the project, please ensure you have the most updated version of setuptools, setuptools_scm and wheel with:
         pip install -U setuptools setuptools_scm wheel

      Traceback (most recent call last):
        File /usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py, line 363, in <module>
          main()
        File /usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py, line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File /usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py, line 130, in get_requires_for_build_wheel
          return hook(config_settings)
        File /usr/lib/python3/dist-packages/setuptools/build_meta.py, line 162, in get_requires_for_build_wheel
          return self._get_build_requires(
        File /usr/lib/python3/dist-packages/setuptools/build_meta.py, line 143, in _get_build_requires
          self.run_setup()
        File /usr/lib/python3/dist-packages/setuptools/build_meta.py, line 158, in run_setup
          exec(compile(code, __file__, 'exec'), locals())
        File setup.py, line 18, in <module>
          setup(
        File /usr/lib/python3/dist-packages/setuptools/__init__.py, line 153, in setup
          return distutils.core.setup(**attrs)
        File /usr/lib/python3/dist-packages/setuptools/_distutils/core.py, line 109, in setup
          _setup_distribution = dist = klass(attrs)
        File /usr/lib/python3/dist-packages/setuptools/dist.py, line 459, in __init__
          _Distribution.__init__(
        File /usr/lib/python3/dist-packages/setuptools/_distutils/dist.py, line 293, in __init__
          self.finalize_options()
        File /usr/lib/python3/dist-packages/setuptools/dist.py, line 836, in finalize_options
          for ep in sorted(loaded, key=by_order):
        File /usr/lib/python3/dist-packages/setuptools/dist.py, line 835, in <lambda>
          loaded = map(lambda e: e.load(), filtered)
        File /usr/lib/python3/dist-packages/pkg_resources/__init__.py, line 2464, in load
          self.require(*args, **kwargs)
        File /usr/lib/python3/dist-packages/pkg_resources/__init__.py, line 2487, in require
          items = working_set.resolve(reqs, env, installer, extras=self.extras)
        File /usr/lib/python3/dist-packages/pkg_resources/__init__.py, line 782, in resolve
          raise VersionConflict(dist, req).with_context(dependent_req)
      pkg_resources.VersionConflict: (setuptools 59.6.0 (/usr/lib/python3/dist-packages), Requirement.parse('setuptools>=61'))
      [end of output]
Since Python code runs asynchronously in the same process, sharing the global sys.stdout, prints from an
individual call cannot be isolated from other calls.
Was causing issue when instances failed to start and there was no
persistent VM

Jira ticket ALEPH-436 (part of ALEPH-436 )
Wants=

    Configures (weak) requirement dependencies on other units.

Requires=

    Similar to Wants=, but declares a stronger requirement dependency.

    If this unit gets activated, the units listed will be activated as well. If one of the other units fails to activate, and an ordering dependency After= on the failing unit is set, this unit will not be started. Besides, with or without specifying After=, this unit will be stopped (or restarted) if one of the other units is explicitly stopped (or restarted).

Documentation: https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html
Server api1.aleph.im is very slow and outdated
(Core i7 from 2018, up to 40 seconds to respond
to `/metrics` in the monitoring).

We suspect that this causes issues in the
monitoring and performance of the network.

This branch removes all references to api1 and
replaces them with api3 where relevant.

Co-authored-by: Bram <cortex@worlddomination.be>
Co-authored-by: Olivier Le Thanh Duong <olivier@lethanh.be>
Bumps [sentry-sdk](https://github.com/getsentry/sentry-python) from 1.31 to 2.8.0.
- [Release notes](https://github.com/getsentry/sentry-python/releases)
- [Changelog](https://github.com/getsentry/sentry-python/blob/master/CHANGELOG.md)
- [Commits](getsentry/sentry-python@1.31.0...2.8.0)

---
updated-dependencies:
- dependency-name: sentry-sdk
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Fix: Put the correct aleph-message version on the package generation script.
Adding type annotation for a better code clarity and safety.
This has been done on the migration of pydantic branch, but adding it first because it should not
break the code and facilitate the merge of the pydantic PR
This bumps their versions to
- Kubo 0.23.0 -> 0.33.1

The list of changes regarding Kubo too large to be
 mentioned here, but I mostly expect performance
 improvements as the main API has not changed much.
Due to an error reading the denylist.

The error was:
```
Error: constructing the node (see log for full detail):
error walking /home/ipfs/.config/ipfs/denylists:
lstat /home/ipfs/.config/ipfs/denylists: permission denied
```

The [documentation](https://specs.ipfs.tech/compact-denylist-format/)
mentions that:

> Implementations SHOULD look in /etc/ipfs/denylists/ and
> $XDG_CONFIG_HOME/ipfs/denylists/ (default: ~/.config/ipfs/denylists)
> for denylist files.

I am not sure why this only failed on Ubuntu 22.04
and not Debian 12 or Ubuntu 24.04. My first assumption
would be a difference in Systemd.
@codecov
Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.76%. Comparing base (105c7f0) to head (23087df).

Files with missing lines Patch % Lines
src/aleph/vm/orchestrator/utils.py 55.55% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #833      +/-   ##
==========================================
+ Coverage   64.68%   64.76%   +0.08%     
==========================================
  Files          88       88              
  Lines        8160     8169       +9     
  Branches      734      737       +3     
==========================================
+ Hits         5278     5291      +13     
+ Misses       2653     2647       -6     
- Partials      229      231       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@olethanh olethanh force-pushed the ol-aleph-615-executions-resources branch from d76df62 to c515807 Compare September 18, 2025 13:55
@nesitor nesitor self-assigned this Oct 21, 2025
@nesitor nesitor requested a review from odesenfans October 21, 2025 10:54
@nesitor nesitor force-pushed the ol-aleph-615-executions-resources branch from fee8c98 to 3e33743 Compare October 21, 2025 10:57
@nesitor nesitor force-pushed the ol-aleph-615-executions-resources branch from 6713f9b to 23087df Compare October 21, 2025 11:34
@nesitor nesitor marked this pull request as ready for review October 21, 2025 14:21
@nesitor nesitor changed the title WIP ALEPH-615 Expose resources for executions ALEPH-615 Expose resources for executions Oct 21, 2025
if getattr(volume, "size_mib", None):
disk_size_mib += volume.size_mib

return disk_size_mib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if it's coming from the message itself? I don't like that the node can report whatever it wants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because on the message we don't have the real sizes used on the VMs, like on volumes and on runtime and code. The idea of this PR is to show the real resources used by the VMs. Indeed the PR need tobe finished solving the TODO tasks, I have put on review just to check the format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our conversation earlier, the goal of this PR is unclear. Let's not merge it until its purpose is clarified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants