[GPU] paddle.sort and paddle.argsort crash on zero-batch tensors

### bug描述 Describe the Bug

`paddle.sort` and `paddle.argsort` fail on GPU when the input tensor has a zero-sized dimension, even when the sorted axis itself has a positive size. On CPU, `paddle.sort` handles the same empty tensor shape and returns an empty output tensor.

Minimal reproducing example:

```python
import traceback
import paddle

print("paddle:", paddle.__version__)
paddle.device.set_device("gpu:0")

cases = [
    (
        "sort",
        lambda: paddle.sort(
            paddle.empty([0, 3], dtype="int64"),
            axis=1,
            descending=True,
        ),
    ),
    (
        "argsort",
        lambda: paddle.argsort(
            paddle.empty([0, 3, 4], dtype="int32"),
            axis=-2,
            descending=True,
        ),
    ),
]

for name, fn in cases:
    print("running paddle." + name)
    try:
        out = fn()
        print(out.shape, out.dtype, out.place)
    except Exception:
        traceback.print_exc()
```

Expected result:

```text
running paddle.sort
[0, 3] paddle.int64 Place(gpu:0)
running paddle.argsort
[0, 3, 4] paddle.int64 Place(gpu:0)
```

Actual result:

```text
running paddle.sort
Traceback (most recent call last):
  File "<string>", line 25, in <module>
  File ".../site-packages/paddle/tensor/search.py", line 560, in sort
    outs, _ = _C_ops.argsort(x, axis, descending)
OSError: (External) CUDA error(9), invalid configuration argument.
  [Hint: 'cudaErrorInvalidConfiguration'. This indicates that a kernel launch is requesting resources that can never be satisfied by the current device.]
  (at ../paddle/phi/kernels/gpu/argsort_kernel.cu:225)

running paddle.argsort
Traceback (most recent call last):
  File "<string>", line 25, in <module>
  File ".../site-packages/paddle/tensor/search.py", line 103, in argsort
    _, ids = _C_ops.argsort(x, axis, descending)
OSError: (External) CUDA error(9), invalid configuration argument.
  [Hint: 'cudaErrorInvalidConfiguration'. This indicates that a kernel launch is requesting resources that can never be satisfied by the current device.]
  (at ../paddle/phi/kernels/gpu/argsort_kernel.cu:225)
```

For reference, `paddle.sort` succeeds on CPU for the same shape:

```python
import paddle

paddle.device.set_device("cpu")
x = paddle.empty([0, 3], dtype="int64")
out = paddle.sort(x, axis=1, descending=True)
print(out.shape, out.dtype, out.place)
```

```text
[0, 3] paddle.int64 Place(cpu)
```

TensorFlow also handles the corresponding empty `argsort` case on GPU:

```python
import tensorflow as tf

with tf.device("/GPU:0"):
    x = tf.zeros([0, 3, 4], dtype=tf.int32)
    out = tf.argsort(x, axis=-2, direction="DESCENDING")
print(out.shape, out.dtype, out.device)
```

```text
(0, 3, 4) <dtype: 'int32'> /job:localhost/replica:0/task:0/device:GPU:0
```

### 其他补充信息 Additional Supplementary Information

Reproduced in fresh Python processes with `CUDA_LAUNCH_BLOCKING=1`.
Environment:
  - Python: 3.10.20
  - PaddlePaddle: 2.6.1
  - GPU: NVIDIA GeForce RTX 3090
  - NVIDIA driver: 595.58.03
  - Paddle CUDA runtime: 11.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU] paddle.sort and paddle.argsort crash on zero-batch tensors #79367

bug描述 Describe the Bug

其他补充信息 Additional Supplementary Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[GPU] paddle.sort and paddle.argsort crash on zero-batch tensors #79367

Description

bug描述 Describe the Bug

其他补充信息 Additional Supplementary Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions