Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MacOS C Reference Implementation #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mlmitch
Copy link
Collaborator

@mlmitch mlmitch commented Dec 6, 2024

Basic methods for efficient calculation exposed in header
Use an allocatable data structure as handle so data and resources can be reused
Unit testing implemented with CUnit

Comment on lines 10 to 20
// Microsoft SAL annotations are not available on macOS
// Define them to enable use for documentation purposes
#ifndef _In_
#define _In_
#endif
#ifndef _Out_
#define _Out_
#endif
#ifndef _Inout_
#define _Inout_
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You look to be using doxygen documentation comments for the API in this file. Would use of https://www.doxygen.nl/manual/commands.html#cmdparam be a better approach then these Windows-specific defines? They aren't the most OS agnostic.

Copy link
Collaborator Author

@mlmitch mlmitch Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm ... coming back to this, its a little bit silly to have these for non-windows since each header file is platform specific

*
* @return 0 on success, -1 on error.
*/
int cpid_make_uuid(_In_ cpid_handle_t library_handle, _In_ uint32_t pid, _In_ uint64_t creation_time_unix_epoch_seconds, _In_ uint32_t creation_time_micros_offset, _Out_ uuid_t uuid);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would pid_t be a more appropriate type for pid? On my machine (and most I believe) pid_t is signed, which makes me think an unsigned type may not be appropriate.

https://www.gnu.org/software/libc/manual/html_node/Process-Identification.html

Copy link
Collaborator Author

@mlmitch mlmitch Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

I take your point about pid being signed. It might be a good idea to change the CPID spec so the pid part of digest inputs is an int64_t instead of uint64_t. This would technically be a breaking change, but no one is using the max range of uint64_t so I think it would be a good change in practice.

Thoughts?

src/constants.h Outdated
#pragma once

#define SHA256_BUFFER_SIZE 32
#define MAX_PID 0x7FFFFFFF
Copy link

@jonahtomrobinson jonahtomrobinson Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is configurable and system dependent, does it make sense to define it ourselves?

On a standard Linux (potentially unix) system the max default is 32768.

Why are we defining this number in hex?

Copy link
Collaborator Author

@mlmitch mlmitch Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With making the PID field pid_t/int64_t, I've removed this check and the OS call will just fail.

src/constants.h Outdated
Comment on lines 6 to 7
#define RETURN_SUCCESS 0
#define RETURN_ERROR -1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally this seems a tad overkill, these are very standard return values with standard meanings.
We've also specified the meaning of the return values explicitly in the API comments

@return 0 on success, -1 on error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern about this implementation is that every function returns only success or failure. If it's a failure, the caller has no way to determine why it has failed. Given that this is a "UNIXy" implementation, I'd have thought that either:

  1. Each function would directly return 0 on success or else an error code from errno.h.
  2. Each function would return success/failure and set errno to an error code from errno.h.

By way of comparison, each function in the Windows implementation returns a standard winerror.h code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be just return true/false from stdbool and simplify the API if we simply don't care about HOW it failed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got rid of constants. returning 0 and -1 directly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this empty file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty windows and linuc cmake linux files are just there to establish directory structure

Comment on lines 2 to 12
find_package(OpenSSL REQUIRED)

find_library(IOKIT_LIBRARY IOKit)
if(NOT IOKIT_LIBRARY)
message(FATAL_ERROR "IO Kit not found")
endif()

find_library(COREFOUNDATION_LIBRARY CoreFoundation)
if(NOT COREFOUNDATION_LIBRARY)
message(FATAL_ERROR "Core Foundation not found")
endif()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the README explain how to acquire these dependencies if they are not available by default? Should we be specifying minimum versions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these are part of the base macos installation

#include <openssl/evp.h>

#include "cpid/cpid_macos.h"
#include "../constants.h"
Copy link

@jonahtomrobinson jonahtomrobinson Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it is preferable to add directory one level above to the include search paths or explicitly include the path to files based on a single top level include. Relative search paths like these are prone to breaking and make it harder to integrate a library in my experience.

Copy link
Collaborator Author

@mlmitch mlmitch Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of constants.h so this no longer applies

Comment on lines 26 to 27
char serial_number[MACOS_SERIAL_NUMBER_BUFFER_SIZE]; // 16 bytes
uuid_t hardware_uuid; // 16 bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing these comments, I don't think they add anything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think they have a usefulness. The MACOS_EXPECTED_DIGEST_INPUT_CONTENT_SIZE macro is mysteriously set to 88 and having field sizes in trailing comments like this helps explain where that value comes from. I'd actually be inclined to add a similar trailing comment to the other fields in this struct, e.g.

typedef struct {
    char serial_number[MACOS_SERIAL_NUMBER_BUFFER_SIZE]; //   16 bytes
    uuid_t hardware_uuid;                                // + 16 bytes
    process_creation_time_t kernel_task_creation_time;   // + 16 bytes
    process_creation_time_t launchd_creation_time;       // + 16 bytes
    process_creation_time_t process_creation_time;       // + 16 bytes
    uint64_t pid;                                        // +  8 bytes
} digest_input_content_t;                                // = 88 bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could go either way for sure. If we keep them then I agree that we should be consistent and add them for all. But equally I would question what value the reader gets from knowing this information and whether we are adding comment management overhead (although admittedly minimal) to changes to this struct.

Copy link
Collaborator Author

@mlmitch mlmitch Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments like dave has. The byte arithmetic for the structure should not change as this is the thing that gets hashed.


int cpid_get_uuid_string(_In_ cpid_handle_t library_handle, _In_ uint32_t pid, _Out_ uuid_string_t uuid_string) {

if (NULL == library_handle || NULL == uuid_string) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following styling.

Suggested change
if (NULL == library_handle || NULL == uuid_string) {
if (!library_handle || !uuid_string) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. A NULL == foo or foo == NULL test always makes me feel like I'm reading Java or C# 😄

Note that this practice is throughout this code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

return RETURN_ERROR;
}

uuid_t uuid;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most standard c-styling would suggest declaring this at the top of the scope. What styling are you following in this project? It should probably be specified in the README.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno about that tbh...we're not writing K&R C here.

Copy link

@jonahtomrobinson jonahtomrobinson Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a proponent for it, rather I'm just picking up on the fact now as it's something that often comes up once there are more eyes on the code. Simply linking to some public styling guide or tool in the README would put this kind of comment to rest from the start.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What styling are you following in this project?

declare a variable as close as possible to its use

Comment on lines 237 to 241
// set the uuid version (UUIDv8)
library_handle->digest_destination_buffer[6] = (library_handle->digest_destination_buffer[6] & 0x0F) | 0x80;

// set the uuid variant
library_handle->digest_destination_buffer[8] = (library_handle->digest_destination_buffer[8] & 0x3F) | 0x80;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider defining these magic numbers as constants and remove the comments. That why the code is (more) self-documenting.

Comment on lines 208 to 210
if (NULL == library_handle || pid > MAX_PID || creation_time_micros_offset > MAX_MICROS_OFFSET || NULL == uuid) {
return RETURN_ERROR;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider breaking these checks up.

First validate the basic null check on pointer params, then perform further validation on the param data itself.

Comment on lines 186 to 187
cpid_finalize(library_handle);
library_handle = NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more flexible to move the NULL set into the cpid_finalize function?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then you'd have to pass the address of that pointer, and that would potentially complicate calling this function from elsewhere. Consider, for example, that in other context where it is called the variable might be const.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal to me either way

Comment on lines 116 to 119
int return_code = sysctl(mib, 4, &process_info, &info_size, NULL, 0);
if (0 != return_code) {
return RETURN_ERROR;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_code doesn't seem to be used elsewhere and isn't used as a "return_code" in the context of this funciton. I'd suggest removing it.

Suggested change
int return_code = sysctl(mib, 4, &process_info, &info_size, NULL, 0);
if (0 != return_code) {
return RETURN_ERROR;
}
if (0 != sysctl(mib, 4, &process_info, &info_size, NULL, 0)) {
return RETURN_ERROR;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is similar code in other functions in this file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason people traditionally do things like that is that it makes debugging easier because the value returned by the function call isn't lost. In a release build, the temporary variable will be optimised away anyway.

An obvious exception is where the return type of the called function is boolean. In that case, the value is clear from the direction that the if-statement takes 😄

Copy link

@jonahtomrobinson jonahtomrobinson Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I get it will be optimized away. If you're at the point of step-through debugging it is easy enough to break out the information you want if needed. To me defining a variable for it it to be basically unused seems verbose.

Certainly not a deal breaker either way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty file, should be removed

README.md Outdated
Comment on lines 28 to 34
Compile:
```
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON ..
make
ctest
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining a CMakePresets.txt could potentially simplify these steps.

Copy link

@davemcincork davemcincork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other major comment I have relates to const correctness. But let's discuss offline so that I'm clearer about the intention behind this code.

Comment on lines 26 to 27
char serial_number[MACOS_SERIAL_NUMBER_BUFFER_SIZE]; // 16 bytes
uuid_t hardware_uuid; // 16 bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think they have a usefulness. The MACOS_EXPECTED_DIGEST_INPUT_CONTENT_SIZE macro is mysteriously set to 88 and having field sizes in trailing comments like this helps explain where that value comes from. I'd actually be inclined to add a similar trailing comment to the other fields in this struct, e.g.

typedef struct {
    char serial_number[MACOS_SERIAL_NUMBER_BUFFER_SIZE]; //   16 bytes
    uuid_t hardware_uuid;                                // + 16 bytes
    process_creation_time_t kernel_task_creation_time;   // + 16 bytes
    process_creation_time_t launchd_creation_time;       // + 16 bytes
    process_creation_time_t process_creation_time;       // + 16 bytes
    uint64_t pid;                                        // +  8 bytes
} digest_input_content_t;                                // = 88 bytes

Comment on lines 116 to 119
int return_code = sysctl(mib, 4, &process_info, &info_size, NULL, 0);
if (0 != return_code) {
return RETURN_ERROR;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason people traditionally do things like that is that it makes debugging easier because the value returned by the function call isn't lost. In a release build, the temporary variable will be optimised away anyway.

An obvious exception is where the return type of the called function is boolean. In that case, the value is clear from the direction that the if-statement takes 😄

Comment on lines 186 to 187
cpid_finalize(library_handle);
library_handle = NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then you'd have to pass the address of that pointer, and that would potentially complicate calling this function from elsewhere. Consider, for example, that in other context where it is called the variable might be const.

#define _Inout_
#endif

typedef struct cpid_struct *cpid_handle_t;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cpid_handle_t should be an opaque void* as it is in the Windows implementation.


#pragma pack(push, 1)
typedef struct {
uint64_t unix_epoch_seconds;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to include <stdint.h> to get the fixed width integer types? If it's working now, it's most likely just a happy side-effect of including another header.

EVP_MD *sha256;
uint8_t digest_destination_buffer[EVP_MAX_MD_SIZE];
digest_input_content_t digest_input_content;
} *cpid_handle_t;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, I don't think you should declaratively associate cpid_handle_t and cpid_struct.


int cpid_get_uuid_string(_In_ cpid_handle_t library_handle, _In_ uint32_t pid, _Out_ uuid_string_t uuid_string) {

if (NULL == library_handle || NULL == uuid_string) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. A NULL == foo or foo == NULL test always makes me feel like I'm reading Java or C# 😄

Note that this practice is throughout this code.

return RETURN_ERROR;
}

uuid_t uuid;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno about that tbh...we're not writing K&R C here.

src/constants.h Outdated
Comment on lines 6 to 7
#define RETURN_SUCCESS 0
#define RETURN_ERROR -1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern about this implementation is that every function returns only success or failure. If it's a failure, the caller has no way to determine why it has failed. Given that this is a "UNIXy" implementation, I'd have thought that either:

  1. Each function would directly return 0 on success or else an error code from errno.h.
  2. Each function would return success/failure and set errno to an error code from errno.h.

By way of comparison, each function in the Windows implementation returns a standard winerror.h code.

Update digest input to use signed integers
Basic methods for efficient calculation exposed in header
Use an allocatable data structure as handle so data and resources can be reused
Unit testing implemented with CUnit
Some const-correctness implemented

Signed-off-by: Mitchell Wasson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants