-
Notifications
You must be signed in to change notification settings - Fork 400
Description
dlt version
1.16.0
Describe the problem
The oracle NUMBER type is used to represent both integer values as well as floating point values depending on the configuration. See https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i22289 for more details.
SQL Alchemy also models this by subclassing sqlalchemy.dialects.oracle.NUMBER from both sqlalchemy.types.Numeric and sqlalchemy.types.Integer. types. See https://docs.sqlalchemy.org/en/20/dialects/oracle.html#sqlalchemy.dialects.oracle.NUMBER
Dlt assumes that all oracle NUMBER types are of the Numeric variant and converts it to a double or decimal. It even mentions that it does this in the relevant code https://github.com/dlt-hub/dlt/blob/devel/dlt/sources/sql_database/schema_types.py#L114
Expected behavior
Instead of assuming that all NUMBER types are of the Numeric variant, dlt should check if NUMBER type is either the Numeric variant or the Integer variant.
In most cases this isn't that big of a deal, but in some larger values cannot be represented as a double while they can be represented as a bigint which becomes a problem during the conversion to arrow:
>>> import pyarrow as pa
>>> pa.array([9309935020231023], type="int64")
<pyarrow.lib.Int64Array object at 0x7b16939b51e0>
[
9309935020231023
]
>>> pa.array([9309935020231023], type="float64")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 375, in pyarrow.lib.array
File "pyarrow/array.pxi", line 45, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 9309935020231023 is outside of the range exactly representable by a IEEE 754 double precision value
Steps to reproduce
Successful pipeline:
- Create an oracle table with a column with type NUMBER(17)
- Insert some integers into the column
- Run a pipeline to extract the table using the sql_table resource
- Explore the dlt schema and notice that the column's data type is double
With the pyarrow error on top:
- Create an oracle table with a column with type NUMBER(17)
- Insert some integers into the column, including a very big number like 9309935020231023
- Run a pipeline to extract the table using the sql_table resource with the pyarrow backend
- Exception
Operating system
Linux
Runtime environment
Local
Python version
3.12
dlt data source
sql_database(
credentials
schema="my_oracle_schema",
defer_table_reflect=True,
chunk_size=10000,
table_names=["my_oracle_table"],
backend="pyarrow",
reflection_level="full_with_precision",
)
dlt destination
No response
Other deployment details
No response
Additional information
The workaround is to use a simple type adapter, but this should work out of the box
from sqlalchemy.dialects.oracle import NUMBER
def type_adapter_callback(sql_type):
if isinstance(sql_type, NUMBER):
return sql_type._type_affinity
return sql_type # Use default detection for other types
Metadata
Metadata
Assignees
Labels
Type
Projects
Status