Skip to content

Commit 57eb959

Browse files
authored
chore: fix typos (#844)
- run [codespell](https://github.com/codespell-project/codespell) on the source code - change name of parameter in db-benchmark.dockerfile based on spelling suggestion and the documentation: https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/install.packages
1 parent 90f5b5b commit 57eb959

23 files changed

+32
-32
lines changed

benchmarks/db-benchmark/db-benchmark.dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ RUN cd pandas && \
5858
RUN cd modin && \
5959
virtualenv py-modin --python=/usr/bin/python3.10
6060

61-
RUN Rscript -e 'install.packages(c("jsonlite","bit64","devtools","rmarkdown"), dependecies=TRUE, repos="https://cloud.r-project.org")'
61+
RUN Rscript -e 'install.packages(c("jsonlite","bit64","devtools","rmarkdown"), dependencies=TRUE, repos="https://cloud.r-project.org")'
6262

6363
SHELL ["/bin/bash", "-c"]
6464

docs/mdbook/src/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
DataFusion is a blazing fast query engine that lets you run data analyses quickly and reliably.
2020

21-
DataFusion is written in Rust, but also exposes Python and SQL bindings, so you can easily query data in your langauge of choice. You don't need to know any Rust to be a happy and productive user of DataFusion.
21+
DataFusion is written in Rust, but also exposes Python and SQL bindings, so you can easily query data in your language of choice. You don't need to know any Rust to be a happy and productive user of DataFusion.
2222

2323
DataFusion lets you run queries faster than pandas. Let's compare query runtimes for a 5GB CSV file with 100 million rows of data.
2424

docs/source/_static/theme_overrides.css

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ a.navbar-brand img {
5656

5757

5858
/* This is the bootstrap CSS style for "table-striped". Since the theme does
59-
not yet provide an easy way to configure this globaly, it easier to simply
59+
not yet provide an easy way to configure this globally, it easier to simply
6060
include this snippet here than updating each table in all rst files to
6161
add ":class: table-striped" */
6262

docs/source/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
# specific language governing permissions and limitations
1616
# under the License.
1717

18-
"""Documenation generation."""
18+
"""Documentation generation."""
1919

2020
# Configuration file for the Sphinx documentation builder.
2121
#

docs/source/user-guide/common-operations/expressions.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Expressions
2121
===========
2222

2323
In DataFusion an expression is an abstraction that represents a computation.
24-
Expressions are used as the primary inputs and ouputs for most functions within
24+
Expressions are used as the primary inputs and outputs for most functions within
2525
DataFusion. As such, expressions can be combined to create expression trees, a
2626
concept shared across most compilers and databases.
2727

examples/export.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,6 @@
4848
pylist = df.to_pylist()
4949
assert pylist == [{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}]
5050

51-
# export to Pyton dictionary of columns
51+
# export to Python dictionary of columns
5252
pydict = df.to_pydict()
5353
assert pydict == {"a": [1, 2, 3], "b": [4, 5, 6]}

examples/python-udf-comparisons.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
# question "return all of the rows that have a specific combination of these
2929
# values". We have the combinations we care about provided as a python
3030
# list of tuples. There is no built in function that supports this operation,
31-
# but it can be explicilty specified via a single expression or we can
31+
# but it can be explicitly specified via a single expression or we can
3232
# use a user defined function.
3333

3434
ctx = SessionContext()

examples/tpch/q02_minimum_cost_supplier.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@
9696
# create a column of that value. We can then filter down any rows for which the cost and
9797
# minimum do not match.
9898

99-
# The default window frame as of 5/6/2024 is from unbounded preceeding to the current row.
99+
# The default window frame as of 5/6/2024 is from unbounded preceding to the current row.
100100
# We want to evaluate the entire data frame, so we specify this.
101101
window_frame = datafusion.WindowFrame("rows", None, None)
102102
df = df.with_column(

examples/tpch/q04_order_priority_checking.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,9 @@
5353

5454
# Limit results to cases where commitment date before receipt date
5555
# Aggregate the results so we only get one row to join with the order table.
56-
# Alterately, and likely more idomatic is instead of `.aggregate` you could
56+
# Alternately, and likely more idiomatic is instead of `.aggregate` you could
5757
# do `.select_columns("l_orderkey").distinct()`. The goal here is to show
58-
# mulitple examples of how to use Data Fusion.
58+
# multiple examples of how to use Data Fusion.
5959
df_lineitem = df_lineitem.filter(col("l_commitdate") < col("l_receiptdate")).aggregate(
6060
[col("l_orderkey")], []
6161
)

examples/tpch/q06_forecasting_revenue_change.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,5 +82,5 @@
8282

8383
revenue = df.collect()[0]["revenue"][0].as_py()
8484

85-
# Note: the output value from this query may be dependant on the size of the database generated
85+
# Note: the output value from this query may be dependent on the size of the database generated
8686
print(f"Potential lost revenue: {revenue:.2f}")

examples/tpch/q07_volume_shipping.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@
7777
# the two nations of interest. Since there is no `otherwise()` statement, any values that do
7878
# not match these will result in a null value and then get filtered out.
7979
#
80-
# To do the same using a simle filter would be:
80+
# To do the same using a simple filter would be:
8181
# df_nation = df_nation.filter((F.col("n_name") == nation_1) | (F.col("n_name") == nation_2))
8282
df_nation = df_nation.with_column(
8383
"n_name",

examples/tpch/q11_important_stock_identification.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@
6363
# Compute total value of specific parts
6464
df = df.aggregate([col("ps_partkey")], [F.sum(col("value")).alias("value")])
6565

66-
# By default window functions go from unbounded preceeding to current row, but we want
66+
# By default window functions go from unbounded preceding to current row, but we want
6767
# to compute this sum across all rows
6868
window_frame = WindowFrame("rows", None, None)
6969

examples/tpch/q15_top_supplier.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@
7878
# from the supplier table
7979
df = df.join(df_supplier, (["l_suppkey"], ["s_suppkey"]), "inner")
8080

81-
# Return only the colums requested
81+
# Return only the columns requested
8282
df = df.select_columns("s_suppkey", "s_name", "s_address", "s_phone", "total_revenue")
8383

8484
# If we have more than one, sort by supplier number (suppkey)

examples/tpch/q20_potential_part_promotion.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@
7474
# This will filter down the line items to the parts of interest
7575
df = df.join(df_part, (["l_partkey"], ["p_partkey"]), "inner")
7676

77-
# Compute the total sold and limit ourselves to indivdual supplier/part combinations
77+
# Compute the total sold and limit ourselves to individual supplier/part combinations
7878
df = df.aggregate(
7979
[col("l_partkey"), col("l_suppkey")], [F.sum(col("l_quantity")).alias("total_sold")]
8080
)

examples/tpch/q21_suppliers_kept_orders_waiting.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@
7474
# only orders where this array is larger than one for multiple supplier orders. The second column
7575
# is all of the suppliers who failed to make their commitment. We can filter the second column for
7676
# arrays with size one. That combination will give us orders that had multiple suppliers where only
77-
# one failed. Use distinct=True in the blow aggregation so we don't get multipe line items from the
77+
# one failed. Use distinct=True in the blow aggregation so we don't get multiple line items from the
7878
# same supplier reported in either array.
7979
df = df.aggregate(
8080
[col("o_orderkey")],

examples/tpch/q22_global_sales_opportunity.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,14 +45,14 @@
4545
# The nation code is a two digit number, but we need to convert it to a string literal
4646
nation_codes = F.make_array(*[lit(str(n)) for n in NATION_CODES])
4747

48-
# Use the substring operation to extract the first two charaters of the phone number
48+
# Use the substring operation to extract the first two characters of the phone number
4949
df = df_customer.with_column("cntrycode", F.substring(col("c_phone"), lit(0), lit(3)))
5050

5151
# Limit our search to customers with some balance and in the country code above
5252
df = df.filter(col("c_acctbal") > lit(0.0))
5353
df = df.filter(~F.array_position(nation_codes, col("cntrycode")).is_null())
5454

55-
# Compute the average balance. By default, the window frame is from unbounded preceeding to the
55+
# Compute the average balance. By default, the window frame is from unbounded preceding to the
5656
# current row. We want our frame to cover the entire data frame.
5757
window_frame = WindowFrame("rows", None, None)
5858
df = df.with_column(

python/datafusion/context.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -436,7 +436,7 @@ def __init__(
436436
437437
Example usage:
438438
439-
The following example demostrates how to use the context to execute
439+
The following example demonstrates how to use the context to execute
440440
a query against a CSV data source using the :py:class:`DataFrame` API::
441441
442442
from datafusion import SessionContext
@@ -853,7 +853,7 @@ def empty_table(self) -> DataFrame:
853853
return DataFrame(self.ctx.empty_table())
854854

855855
def session_id(self) -> str:
856-
"""Retrun an id that uniquely identifies this :py:class:`SessionContext`."""
856+
"""Return an id that uniquely identifies this :py:class:`SessionContext`."""
857857
return self.ctx.session_id()
858858

859859
def read_json(

python/datafusion/expr.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -515,7 +515,7 @@ def __init__(
515515
516516
Args:
517517
units: Should be one of ``rows``, ``range``, or ``groups``.
518-
start_bound: Sets the preceeding bound. Must be >= 0. If none, this
518+
start_bound: Sets the preceding bound. Must be >= 0. If none, this
519519
will be set to unbounded. If unit type is ``groups``, this
520520
parameter must be set.
521521
end_bound: Sets the following bound. Must be >= 0. If none, this

python/datafusion/functions.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -342,7 +342,7 @@ def concat(*args: Expr) -> Expr:
342342
def concat_ws(separator: str, *args: Expr) -> Expr:
343343
"""Concatenates the list ``args`` with the separator.
344344
345-
``NULL`` arugments are ignored. ``separator`` should not be ``NULL``.
345+
``NULL`` arguments are ignored. ``separator`` should not be ``NULL``.
346346
"""
347347
args = [arg.expr for arg in args]
348348
return Expr(f.concat_ws(separator, args))
@@ -541,7 +541,7 @@ def ends_with(arg: Expr, suffix: Expr) -> Expr:
541541

542542

543543
def exp(arg: Expr) -> Expr:
544-
"""Returns the exponential of the arugment."""
544+
"""Returns the exponential of the argument."""
545545
return Expr(f.exp(arg.expr))
546546

547547

@@ -1593,7 +1593,7 @@ def grouping(arg: Expr, distinct: bool = False) -> Expr:
15931593

15941594

15951595
def max(arg: Expr, distinct: bool = False) -> Expr:
1596-
"""Returns the maximum value of the arugment."""
1596+
"""Returns the maximum value of the argument."""
15971597
return Expr(f.max(arg.expr, distinct=distinct))
15981598

15991599

@@ -1769,12 +1769,12 @@ def bit_xor(arg: Expr, distinct: bool = False) -> Expr:
17691769

17701770

17711771
def bool_and(arg: Expr, distinct: bool = False) -> Expr:
1772-
"""Computes the boolean AND of the arugment."""
1772+
"""Computes the boolean AND of the argument."""
17731773
return Expr(f.bool_and(arg.expr, distinct=distinct))
17741774

17751775

17761776
def bool_or(arg: Expr, distinct: bool = False) -> Expr:
1777-
"""Computes the boolean OR of the arguement."""
1777+
"""Computes the boolean OR of the argument."""
17781778
return Expr(f.bool_or(arg.expr, distinct=distinct))
17791779

17801780

python/datafusion/input/location.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def build_table(
6666
# Consume header row and count number of rows for statistics.
6767
# TODO: Possibly makes sense to have the eager number of rows
6868
# calculated as a configuration since you must read the entire file
69-
# to get that information. However, this should only be occuring
69+
# to get that information. However, this should only be occurring
7070
# at table creation time and therefore shouldn't
7171
# slow down query performance.
7272
with open(input_file, "r") as file:
@@ -75,7 +75,7 @@ def build_table(
7575
print(header_row)
7676
for _ in reader:
7777
num_rows += 1
78-
# TODO: Need to actually consume this row into resonable columns
78+
# TODO: Need to actually consume this row into reasonable columns
7979
raise RuntimeError("TODO: Currently unable to support CSV input files.")
8080
else:
8181
raise RuntimeError(

python/datafusion/udf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ def state(self) -> List[pyarrow.Scalar]:
153153

154154
@abstractmethod
155155
def update(self, values: pyarrow.Array) -> None:
156-
"""Evalute an array of values and update state."""
156+
"""Evaluate an array of values and update state."""
157157
pass
158158

159159
@abstractmethod
@@ -189,7 +189,7 @@ def __init__(
189189
) -> None:
190190
"""Instantiate a user defined aggregate function (UDAF).
191191
192-
See :py:func:`udaf` for a convenience function and arugment
192+
See :py:func:`udaf` for a convenience function and argument
193193
descriptions.
194194
"""
195195
self._udf = df_internal.AggregateUDF(

src/common/data_type.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ pub enum RexType {
4040
/// Arrow types which represents the underlying arrow format
4141
/// Python types which represent the type in Python
4242
/// It is important to keep all of those types in a single
43-
/// and managable location. Therefore this structure exists
43+
/// and manageable location. Therefore this structure exists
4444
/// to map those types and provide a simple place for developers
4545
/// to map types from one system to another.
4646
#[derive(Debug, Clone)]

src/expr/table_scan.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ impl PyTableScan {
9494

9595
/// The column indexes that should be. Note if this is empty then
9696
/// all columns should be read by the `TableProvider`. This function
97-
/// provides a Tuple of the (index, column_name) to make things simplier
97+
/// provides a Tuple of the (index, column_name) to make things simpler
9898
/// for the calling code since often times the name is preferred to
9999
/// the index which is a lower level abstraction.
100100
#[pyo3(name = "projection")]

0 commit comments

Comments
 (0)