-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-128213: fast path for bytes creation from list and tuple #128214
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark using python -m timeit -n 100 -s 'a = [40] * 100000' 'bytes(a)' showing a ~81% increase in performance:
Can we have better benchmarks, namely:
- Benchmarks for small lists (< 100 items), medium-sized lists (< 10k items), large lists (>100k). It is important to know how this affect other paths and this should be properly reflected in the NEWS entry.
- Are the benchmarks on a DEBUG or a RELEASE build (possibly PGO/LTO?). Benchmarks on DEBUG builds are not really important, it's better to have a PGO build or at least a release build (-O3)
- Can we check with lists that don't have the same value? namely, use
a = [random.randint(0, 255) for _ in range(N)]
. Please also check that int subclasses are using the fast path.
Objects/bytesobject.c
Outdated
return Py_None; // None as fallback sentinel to the slow path | ||
} | ||
int overflow; | ||
long value = PyLong_AsLongAndOverflow(items[i], &overflow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're still assuming a long object,. we can just use PyLong_AsInt
instead. We don't care about an overflow as we're only interested in values in [0, 255].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I copied this code from bytearray but you're absolutely right that PyLong_AsInt
is a much better fit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An existing test with -sys.maxsize
failed with PyLong_AsInt
. A fix would be to check for OverflowError
and reraise ValueError
instead, but I think it's easier to simply revert to PyLong_AsLongAndOverflow
and let the range check that follows raise ValueError
when out of the byte range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we used PyNumber_AsSsize_t
(which invokes _PyNumber_Index
, hence possibly creating side effects). This will be a small behavioral change. While I understand Guido's comment, I'm wondering whether we should keep the old behaviour (though I don't know how it could be useful in production, namely making bytes()
have side-effects on lists if one of their element being converted invokes __index__
).
Note that the current code also prevented crashes by temporarily increfing item
before calling PyNumber_AsSsize_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is not supposed to be about altering behaviors, I'll keep the current behavior for this PR and file a separate bug against the current behavior then. It should be considered a bug because like Guido pointed out, a list can potentially be mutated inside __index__
, resulting in the item copy loop accessing freed memory. Despite the unlikelihood that such code exists in the real world, it can still potentially happen and cause a crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have a crash because the list size is checked at every iteration (though what should be checked perhaps is that the list pointer is not NULL). We incref the item before calling __index__
on it so it shouldn't cause crashes. INCREFing before is a trick you also use for use-after-free issues and evil mutations (but we can easily check if this is crashing or not as follows:
class EvilInt:
def __index__(self):
x.clear()
return 0
x = [1,2, EvilInt(), 4]
bytes(x)
and this does not crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh now I see what INCREF and DECREF are doing there. I'll add them back in a bit too then. Thanks.
Right. I've now updated my benchmarks accordingly with a stripped RELEASE build ( |
@@ -0,0 +1,3 @@ | |||
Speed up :class:`bytes` creation from :class:`list` and :class:`tuple` of integers. Benchmarks show that from a list with 1000000 random numbers the time to create a bytes object is reduced by around 31%, or 30% with 10000 numbers, or 27% with 100 numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the pyperf benchmarks on the PR as well? (namely, the nice table with two columns and the diffs as well as the benchmark script? thanks)
@@ -0,0 +1,3 @@ | |||
Speed up :class:`bytes` creation from :class:`list` and :class:`tuple` of integers. Benchmarks show that from a list with 1000000 random numbers the time to create a bytes object is reduced by around 31%, or 30% with 10000 numbers, or 27% with 100 numbers. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, NEWS should not contain an empty line.
Objects/bytesobject.c
Outdated
return Py_None; // None as fallback sentinel to the slow path | ||
} | ||
int overflow; | ||
long value = PyLong_AsLongAndOverflow(items[i], &overflow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we used PyNumber_AsSsize_t
(which invokes _PyNumber_Index
, hence possibly creating side effects). This will be a small behavioral change. While I understand Guido's comment, I'm wondering whether we should keep the old behaviour (though I don't know how it could be useful in production, namely making bytes()
have side-effects on lists if one of their element being converted invokes __index__
).
Note that the current code also prevented crashes by temporarily increfing item
before calling PyNumber_AsSsize_t
.
For benchmarks, we prefer having a comparison in terms of mean and standard deviation rather than the best of 5 which could be just "good" data points. As such, it's better to use |
value = PyNumber_AsSsize_t(item, NULL); | ||
if (value == -1 && PyErr_Occurred()) | ||
char *str = PyBytes_AS_STRING(bytes); | ||
PyObject *const *items = PySequence_Fast_ITEMS(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going for performance, then we can do even better. PySequence_Fast_ITEMS
will call PyList_Check
, but we already know that it's exactly a list or tuple here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But all we know is that it is either a list or a tuple, but we still don't know which of the two it is, so a PyList_Check
call is still in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but PyList_Check
is extra work because that will check for a subclass. It's not really going to be noticable, but it's something to think about :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you can just duplicate the functions as we did previously. This is not really an issue IMO, and we couold use the fact that tuples are immutables for instance to avoid INCREF/DECREF values.
…sing PyNumber_AsSsize_t; fixed indentation
I see. If it's the norm here then I will propose in Discourse for |
It's more than just including the mean and the standard deviation actually. timeit is not always sufficient for micro-benchmarks like these and does not allow you to calibrate your CPU or compare with reference implementations. Using But yes, |
Oh by the way. I just remember something, but a naive I'm really interested in the pyperf benchmarks because they might differ (maybe the best of 5s are way faster than the average runs) |
if (value == -1 && PyErr_Occurred()) | ||
char *str = PyBytes_AS_STRING(bytes); | ||
PyObject *const *items = PySequence_Fast_ITEMS(x); | ||
Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to acquire the critical section before calling PySequence_Fast_ITEMS
and PySequence_Fast_GET_SIZE
. Attempting to read mutable data without a lock isn't thread safe.
PyObject *const *items = PySequence_Fast_ITEMS(x); | ||
Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(x); | ||
for (Py_ssize_t i = 0; i < size; i++) { | ||
if (!PyLong_Check(items[i])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyLong_CheckExact
probably fits better here, and is faster!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The previous code wasn't doing an exact check so we shouldn't change it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it wouldn't be a breaking change, just a performance loss for the (very niche!) set of cases that use special ints. I'm also slightly worried that non-exact ints might have some nasty side effects that we aren't anticipating here (e.g. can they mess with the Py_ssize_t
value?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can mess with Py_ssize_t
values but only through __index__
and we already check that with PyNumber_AsSsize_t
, but the problem is wider. For example, np.int32()
values can be passed to bytes([...])
but without this check, we will impact performances of numpy-related code which is something we don't want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is casting to bytes a common thing to do with numpy integers, or is that speculation? (I see your point, I'm just sort of gauging what the cost-benefit would be here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say yes, if we're considering serialization or introspection. I imagine that we can have such things when people are working with images because their arrays won't necessarily be pure Python lists but np.ndarray
objects which may have non-primitive data types. I'm not aware of a wild opensource usage though but IMO, since we want to improve performance overall, we shouldn't penalize existing users. If we can slightly decrease our performance gain so to have a stable result, it's good.
Nonetheless, if we want to fallback to a slow path for non-exact ints, benchmarks should show how performances are impacted (a simple timeit
could be sufficient but I'd be more comfortable with a much precise benchmark).
Most changes were addressed (although I'm waiting for a pyperf comparison, but this can wait)
done: | ||
/* some C parsers require a label not to be at the end of a compound | ||
statement, which the ending macro of a critical section introduces, so | ||
we need an empty statement here to satisfy that syntax rule */ | ||
; | ||
/* both success and failure need to end the critical section */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of these goto tricks, consider instead to pull out the critical section code into a helper function; it should result in cleaner and more readable code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid gotos in I defined a helper Py_EXIT_CRITICAL_SECTION
in Include/internal/pycore_critical_section.h to be able to return at errors. Note: the PR has not been merged yet, so I am not sure this is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, refactoring into a helper would definitely be cleaner and we wouldn't the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I think we could have those macro helpers because they would help us in those situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid gotos in I defined a helper
Py_EXIT_CRITICAL_SECTION
in Include/internal/pycore_critical_section.h to be able to return at errors. Note: the PR has not been merged yet, so I am not sure this is the way to go.
I'm not sure about introducing such an API. IMO, it is better to keep the critical section APIs dead simple. If you need to avoid gotos, refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. With critical sections the safest way to go is to create a foo_lock_held
function, or use AC's @critical_section
if applicable. We really don't want cases where a critical section isn't popped off at the end.
PyObject *const *items = PySequence_Fast_ITEMS(x); | ||
Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyObject *const *items = PySequence_Fast_ITEMS(x); | |
Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(x); | |
Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(x); | |
PyObject *const *items = PySequence_Fast_ITEMS(x); |
Benchmark using
python -m timeit -n 100 -s 'a = __import__("random").choices(range(256), k=1000000)' 'bytes(a)'
showing a ~31% reduction in time consumed:With
k=10000
, a ~30% time reduction:With
k=100
, a ~27% time reduction: