Skip to content

Commit

Permalink
Json uploads (#647)
Browse files Browse the repository at this point in the history
* Upload json or json-list files, in addition to CSV and SQLite
* Improvements to user uploads; refactoring, test coverage, better UI, bug fixes, logging
* Clicking a field name in schema explorer copies it to clipboard
* Limit charts to 10 numerical series for performance (past 10 it's incomprehensible anyway)
* Limit sampling of uploaded files to 5k rows for the purposes of type inference
  • Loading branch information
chrisclark authored Jul 23, 2024
1 parent 4b31630 commit 8972f12
Show file tree
Hide file tree
Showing 27 changed files with 1,433 additions and 250 deletions.
23 changes: 11 additions & 12 deletions docs/features.rst
Original file line number Diff line number Diff line change
@@ -1,22 +1,20 @@
Features
========

Easy to get started
-------------------
- Built on Django's ORM, so works with MySQL, Postgres, Oracle,
SQLite, Snowflake, MS SQL Server, RedShift, and MariaDB.
- If you want to use Snowflake or SQL Server, you will need to install the relevant package
(e.g. https://pypi.org/project/django-snowflake/, https://github.com/microsoft/mssql-django)
- Small number of dependencies.
- MIT licensed (except for functionality in the /ee/ directory,
which is still free for commercial use, but can't be resold).

SQL Assistant
-------------
- Built in integration with OpenAI (or the LLM of your choosing)
to quickly get help with your query, with relevant schema
automatically injected into the prompt. Simple, effective.

Database Support
----------------
- Supports MySql, postgres (and, by extension, pg-connection-compatible DBs like Redshift), SQLite,
Oracle, MS SQL Server, MariaDB, and Snowflake
- Note for Snowflake or SQL Server, you will need to install the relevant Django connection package
(e.g. https://pypi.org/project/django-snowflake/, https://github.com/microsoft/mssql-django)
- Also supports ad-hoc data sources by uploading JSON, CSV, or SQLite files directly.

Snapshots
---------
- Tick the 'snapshot' box on a query, and Explorer will upload a
Expand Down Expand Up @@ -120,7 +118,8 @@ Displaying query results as charts
----------------------------------

If the results table has numeric columns, they can be displayed in a bar chart. The first column will always be used
as the x-axis labels. This is quite basic, but can be useful for quick visualization.
as the x-axis labels. This is quite basic, but can be useful for quick visualization. Charts (if enabled) will render
for query results with ten or fewer numeric columns. With more series than that, the charts become a hot mess quickly.

To enable this feature, set ``EXPLORER_CHARTS_ENABLED`` setting to ``True`` and install the plotting library
``matplotlib`` with:
Expand Down Expand Up @@ -169,7 +168,7 @@ Multiple Connections
way. See connections.py for more documentation on
multi-connection setup.
- SQL Explorer also supports user-provided connections in the form
of standard database connection details, or uploading CSV or SQLite
of standard database connection details, or uploading CSV, JSON or SQLite
files. See the 'User uploads' section of :doc:`settings`.

Power tips
Expand Down
3 changes: 2 additions & 1 deletion explorer/charts.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ def get_chart(result: QueryResult, chart_type: str) -> Optional[str]:
c for c in range(1, len(result.data[0]))
if all([isinstance(col[c], (int, float)) or col[c] is None for col in result.data])
]
if len(numeric_columns) < 1:
# Don't create charts for > 10 series. This is a lightweight visualization.
if len(numeric_columns) < 1 or len(numeric_columns) > 10:
return None
labels = [row[0] for row in result.data]
fig, ax = plt.subplots(figsize=(10, 3.8))
Expand Down
24 changes: 24 additions & 0 deletions explorer/ee/db_connections/create_sqlite.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import os
from io import BytesIO

from explorer.ee.db_connections.type_infer import get_parser
from explorer.ee.db_connections.utils import pandas_to_sqlite


def parse_to_sqlite(file) -> (BytesIO, str):
f_name = file.name
f_bytes = file.read()
df_parser = get_parser(file)
if df_parser:
df = df_parser(f_bytes)
try:
f_bytes = pandas_to_sqlite(df, local_path=f"{f_name}_tmp_local.db")
except Exception as e: # noqa
raise ValueError(f"Error while parsing {f_name}: {e}") from e
# replace the previous extension with .db, as it is now a sqlite file
name, _ = os.path.splitext(f_name)
f_name = f"{name}.db"
else:
return BytesIO(f_bytes), f_name # if it's a SQLite file already, simply cough it up as a BytesIO object
return f_bytes, f_name

54 changes: 54 additions & 0 deletions explorer/ee/db_connections/mime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import csv
import json

# These are 'shallow' checks. They are just to understand if the upload appears valid at surface-level.
# A deeper check will happen when pandas tries to parse the file.
# This is designed to be quick, and simply assigned the right (full) parsing function to the uploaded file.


def is_csv(file):
if file.content_type != "text/csv":
return False
try:
# Check if the file content can be read as a CSV
file.seek(0)
sample = file.read(1024).decode("utf-8")
csv.Sniffer().sniff(sample)
file.seek(0)
return True
except csv.Error:
return False


def is_json(file):
if file.content_type != "application/json":
return False
if not file.name.lower().endswith(".json"):
return False
return True


def is_json_list(file):
if not file.name.lower().endswith(".json"):
return False
file.seek(0)
first_line = file.readline()
file.seek(0)
try:
json.loads(first_line.decode("utf-8"))
return True
except ValueError:
return False


def is_sqlite(file):
if file.content_type != "application/x-sqlite3":
return False
try:
# Check if the file starts with the SQLite file header
file.seek(0)
header = file.read(16)
file.seek(0)
return header == b"SQLite format 3\x00"
except Exception as e: # noqa
return False
132 changes: 132 additions & 0 deletions explorer/ee/db_connections/type_infer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
import io
import json
from explorer.ee.db_connections.mime import is_csv, is_json, is_sqlite, is_json_list


MAX_TYPING_SAMPLE_SIZE = 5000
SHORTEST_PLAUSIBLE_DATE_STRING = 5


def get_parser(file):
if is_csv(file):
return csv_to_typed_df
if is_json_list(file):
return json_list_to_typed_df
if is_json(file):
return json_to_typed_df
if is_sqlite(file):
return None
raise ValueError(f"File {file.content_type} not supported.")


def csv_to_typed_df(csv_bytes, delimiter=",", has_headers=True):
import pandas as pd
csv_file = io.BytesIO(csv_bytes)
df = pd.read_csv(csv_file, sep=delimiter, header=0 if has_headers else None)
return df_to_typed_df(df)


def json_list_to_typed_df(json_bytes):
import pandas as pd
data = []
for line in io.BytesIO(json_bytes).readlines():
data.append(json.loads(line.decode("utf-8")))

df = pd.json_normalize(data)
return df_to_typed_df(df)


def json_to_typed_df(json_bytes):
import pandas as pd
json_file = io.BytesIO(json_bytes)
json_content = json.load(json_file)
df = pd.json_normalize(json_content)
return df_to_typed_df(df)


def atof_custom(value):
# Remove any thousands separators and convert the decimal point
if "," in value and "." in value:
if value.index(",") < value.index("."):
# 0,000.00 format
value = value.replace(",", "")
else:
# 0.000,00 format
value = value.replace(".", "").replace(",", ".")
elif "," in value:
# No decimal point, only thousands separator
value = value.replace(",", "")
return float(value)



def df_to_typed_df(df): # noqa
import pandas as pd
from dateutil import parser
try:

for column in df.columns:

# If we somehow have an array within a field (e.g. from a json object) then convert it to a string
df[column] = df[column].apply(lambda x: str(x) if isinstance(x, list) else x)

values = df[column].dropna().unique()
if len(values) > MAX_TYPING_SAMPLE_SIZE:
values = pd.Series(values).sample(MAX_TYPING_SAMPLE_SIZE, random_state=42).to_numpy()

is_date = False
is_integer = True
is_float = True

for value in values:
try:
float_val = atof_custom(str(value))
if float_val == int(float_val):
continue # This is effectively an integer
else:
is_integer = False
except ValueError:
is_integer = False
is_float = False
break

if is_integer:
is_float = False

if not is_integer and not is_float:
is_date = True

# The dateutil parser is very aggressive and will interpret many short strings as dates.
# For example "12a" will be interpreted as 12:00 AM on the current date.
# That is not the behavior anyone wants. The shortest plausible date string is e.g. 1-1-23
try_parse = [v for v in values if len(str(v)) > SHORTEST_PLAUSIBLE_DATE_STRING]
if len(try_parse) > 0:
for value in try_parse:
try:
parser.parse(str(value))
except (ValueError, TypeError, OverflowError):
is_date = False
break
else:
is_date = False

if is_date:
df[column] = pd.to_datetime(df[column], errors="coerce", utc=True)
elif is_integer:
df[column] = df[column].apply(lambda x: int(atof_custom(str(x))) if pd.notna(x) else x)
# If there are NaN / blank values, the column will be converted to float
# Convert it back to integer
df[column] = df[column].astype("Int64")
elif is_float:
df[column] = df[column].apply(lambda x: atof_custom(str(x)) if pd.notna(x) else x)
else:
inferred_type = pd.api.types.infer_dtype(values)
if inferred_type == "integer":
df[column] = pd.to_numeric(df[column], errors="coerce", downcast="integer")
elif inferred_type == "floating":
df[column] = pd.to_numeric(df[column], errors="coerce")

return df

except pd.errors.ParserError as e:
return str(e)
93 changes: 0 additions & 93 deletions explorer/ee/db_connections/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,96 +108,3 @@ def pandas_to_sqlite(df, local_path="local_database.db"):
# Delete the local SQLite database file
# Finally block to ensure we don't litter files around
os.remove(local_path)


MAX_TYPING_SAMPLE_SIZE = 10000
SHORTEST_PLAUSIBLE_DATE_STRING = 5


def atof_custom(value):
# Remove any thousands separators and convert the decimal point
if "," in value and "." in value:
if value.index(",") < value.index("."):
# 0,000.00 format
value = value.replace(",", "")
else:
# 0.000,00 format
value = value.replace(".", "").replace(",", ".")
elif "," in value:
# No decimal point, only thousands separator
value = value.replace(",", "")
return float(value)

def csv_to_typed_df(csv_bytes, delimiter=",", has_headers=True): # noqa
import pandas as pd
from dateutil import parser
try:

csv_file = io.BytesIO(csv_bytes)
df = pd.read_csv(csv_file, sep=delimiter, header=0 if has_headers else None)

for column in df.columns:
values = df[column].dropna().unique()
if len(values) > MAX_TYPING_SAMPLE_SIZE:
values = pd.Series(values).sample(MAX_TYPING_SAMPLE_SIZE, random_state=42).to_numpy()

is_date = False
is_integer = True
is_float = True

for value in values:
try:
float_val = atof_custom(str(value))
if float_val == int(float_val):
continue # This is effectively an integer
else:
is_integer = False
except ValueError:
is_integer = False
is_float = False
break

if is_integer:
is_float = False

if not is_integer and not is_float:
is_date = True

# The dateutil parser is very aggressive and will interpret many short strings as dates.
# For example "12a" will be interpreted as 12:00 AM on the current date.
# That is not the behavior anyone wants. The shortest plausible date string is e.g. 1-1-23
try_parse = [v for v in values if len(str(v)) > SHORTEST_PLAUSIBLE_DATE_STRING]
if len(try_parse) > 0:
for value in try_parse:
try:
parser.parse(str(value))
except (ValueError, TypeError, OverflowError):
is_date = False
break
else:
is_date = False

if is_date:
df[column] = pd.to_datetime(df[column], errors="coerce", utc=True)
elif is_integer:
df[column] = df[column].apply(lambda x: int(atof_custom(str(x))) if pd.notna(x) else x)
# If there are NaN / blank values, the column will be converted to float
# Convert it back to integer
df[column] = df[column].astype("Int64")
elif is_float:
df[column] = df[column].apply(lambda x: atof_custom(str(x)) if pd.notna(x) else x)
else:
inferred_type = pd.api.types.infer_dtype(values)
if inferred_type == "integer":
df[column] = pd.to_numeric(df[column], errors="coerce", downcast="integer")
elif inferred_type == "floating":
df[column] = pd.to_numeric(df[column], errors="coerce")

return df

except pd.errors.ParserError as e:
return str(e)


def is_csv(file):
return file.content_type == "text/csv"
Loading

0 comments on commit 8972f12

Please sign in to comment.