Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced uploads #651

Merged
merged 2 commits into from
Aug 5, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 43 additions & 1 deletion docs/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,49 @@ Multiple Connections
multi-connection setup.
- SQL Explorer also supports user-provided connections in the form
of standard database connection details, or uploading CSV, JSON or SQLite
files. See the 'User uploads' section of :doc:`settings`.
files.

File Uploads
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marksweb If you have a moment, I'd love for you to give this a read and let me know what you think. Not asking for a full code review, just looking over the docs and make sure the functionality is clear to you. If there are outstanding questions, I can address them in the docs, or perhaps just by reading the docs you will think of some edge case or test I haven't thought of! I would welcome a brain-dump from you on this functionality.

------------

Upload CSV or JSON files, or SQLite databases to immediately create connections for querying.

The base name of the file and the ID of the uploaded is used as the database name, to prevent collisions from multiple
users uploading a file with the same name. The base name of the file is also used as the table name (e.g. uploading
customers.csv results in a database file named customers_1.db, with a table named 'customers').

Of interest, you can also append uploaded files to previously uploaded data sources. For example, if you had a
'customers.csv' file and an 'orders.csv' file, you could upload customers.csv and create a new data source. You can
then go back and upload orders.csv with the 'Append' drop-down set to your newly-created customers database, and you
will have a resulting SQLite database connection with both tables available to be queried together. If you were to
upload a new 'orders.csv' and append it to customers, the table 'orders' would be *fully replaced* with the new file.

**How it works**

1. Your file is uploaded to the web server. For CSV files, the first row is assumed to be a header.
2. It is read into a Pandas dataframe. Many fields end up as strings that are in fact numeric or datetimes.
3. During this step, if it is a json file, the json is 'normalized'. E.g. nested objects are flattened.
4. A customer parser runs type-detection on each column for richer typer information.
5. The dataframe is coerced to these more accurate types.
6. The dataframe is written to a SQLite file, which is present on the server, and uploaded to S3.
7. The SQLite database is added as a new connection to SQL Explorer and is available for querying, just like any
other data source.
8. If the SQLite file is not available locally, it will be pulled on-demand from S3 when needed.
9. Local SQLite files are periodically cleaned up by a recurring task after (by default) 7 days of inactivity.

Note that if the upload is a SQLite database, steps 2-5 are skipped and the database is simply uploaded to S3 and made
available for querying.

**File formats**

- Supports well-formed .csv, and .json files. Also supports .json files where each line of the file is a separate json
object. See /explorer/tests/json/ in the source for examples of what is supported.
- Supports SQLite files with a .db or .sqlite extension. The validity of the SQLite file is not fully checked until
a query is attempted.

**Configuration**

- See the 'User uploads' section of :doc:`settings` for configuration details.

Power tips
----------
Expand Down
4 changes: 2 additions & 2 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -383,7 +383,7 @@ User Uploads
With `EXPLORER_DB_CONNECTIONS_ENABLED` set to `True`, you can also set `EXPLORER_USER_UPLOADS_ENABLED` to allow users
to upload their own CSV and SQLite files directly to explorer as new connections.

Go to connections->Add New and scroll down to see the upload interface. The uploaded files are limited in size by the
Go to connections->Upload File. The uploaded files are limited in size by the
`EXPLORER_MAX_UPLOAD_SIZE` setting which is set to 500mb by default (500 * 1024 * 1024). SQLite files (in either .db or
.sqlite) will simple appear as connections. CSV files get run through a parser that infers the type of each field.
.sqlite) will simply appear as connections. CSV files get run through a parser that infers the type of each field.

2 changes: 1 addition & 1 deletion explorer/charts.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def get_chart(result: QueryResult, chart_type: str) -> Optional[str]:
bar_positions = []
for idx, col_num in enumerate(numeric_columns):
if chart_type == "bar":
values = [row[col_num] for row in result.data]
values = [row[col_num] if row[col_num] is not None else 0 for row in result.data]
bar_container = ax.bar([x + idx * BAR_WIDTH
for x in range(len(labels))], values, BAR_WIDTH, label=result.headers[col_num])
bars.append(bar_container)
Expand Down
41 changes: 30 additions & 11 deletions explorer/ee/db_connections/create_sqlite.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,43 @@
import os
from io import BytesIO

from explorer.utils import secure_filename
from explorer.ee.db_connections.type_infer import get_parser
from explorer.ee.db_connections.utils import pandas_to_sqlite
from explorer.ee.db_connections.utils import pandas_to_sqlite, uploaded_db_local_path


def parse_to_sqlite(file) -> (BytesIO, str):
f_name = file.name
f_bytes = file.read()
def get_names(file, append_conn=None, user_id=None):
s_filename = secure_filename(file.name)
table_name, _ = os.path.splitext(s_filename)

# f_name represents the filename of both the sqlite DB on S3, and on the local filesystem.
# If we are appending to an existing data source, then we re-use the same name.
# New connections get a new database name.
if append_conn:
f_name = os.path.basename(append_conn.name)
else:
f_name = f"{table_name}_{user_id}.db"

return table_name, f_name


def parse_to_sqlite(file, append_conn=None, user_id=None) -> (BytesIO, str):

table_name, f_name = get_names(file, append_conn, user_id)

# When appending, make sure the database exists locally so that we can write to it
if append_conn:
append_conn.download_sqlite_if_needed()

df_parser = get_parser(file)
if df_parser:
df = df_parser(f_bytes)
try:
f_bytes = pandas_to_sqlite(df, local_path=f"{f_name}_tmp_local.db")
df = df_parser(file.read())
local_path = uploaded_db_local_path(f_name)
f_bytes = pandas_to_sqlite(df, table_name, local_path)
except Exception as e: # noqa
raise ValueError(f"Error while parsing {f_name}: {e}") from e
# replace the previous extension with .db, as it is now a sqlite file
name, _ = os.path.splitext(f_name)
f_name = f"{name}.db"
else:
return BytesIO(f_bytes), f_name # if it's a SQLite file already, simply cough it up as a BytesIO object
# If it's a SQLite file already, simply cough it up as a BytesIO object
return BytesIO(file.read()), f_name
return f_bytes, f_name

2 changes: 1 addition & 1 deletion explorer/ee/db_connections/mime.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def is_json_list(file):


def is_sqlite(file):
if file.content_type != "application/x-sqlite3":
if file.content_type not in ["application/x-sqlite3", "application/octet-stream"]:
return False
try:
# Check if the file starts with the SQLite file header
Expand Down
31 changes: 28 additions & 3 deletions explorer/ee/db_connections/models.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
import os

from django.conf import settings
from django.core.exceptions import ValidationError
from django.db import models
from django.db.models.signals import pre_save
from django.dispatch import receiver
from explorer.ee.db_connections.utils import user_dbs_local_dir
from explorer.ee.db_connections.utils import uploaded_db_local_path, quick_hash

from django_cryptography.fields import encrypt

Expand Down Expand Up @@ -33,18 +32,44 @@ class DatabaseConnection(models.Model):
host = encrypt(models.CharField(max_length=255, blank=True))
port = models.CharField(max_length=255, blank=True)
extras = models.JSONField(blank=True, null=True)
upload_fingerprint = models.CharField(max_length=255, blank=True, null=True)

def __str__(self):
return f"{self.name} ({self.alias})"

def update_fingerprint(self):
self.upload_fingerprint = self.local_fingerprint()
self.save()

def local_fingerprint(self):
if os.path.exists(self.local_name):
return quick_hash(self.local_name)

def _download_sqlite(self):
from explorer.utils import get_s3_bucket
s3 = get_s3_bucket()
s3.download_file(self.host, self.local_name)

def download_sqlite_if_needed(self):
download = not os.path.exists(self.local_name) or self.local_fingerprint() != self.upload_fingerprint

if download:
self._download_sqlite()
self.update_fingerprint()


@property
def is_upload(self):
return self.engine == self.SQLITE and self.host

@property
def local_name(self):
if self.is_upload:
return os.path.join(user_dbs_local_dir(), self.name)
return uploaded_db_local_path(self.name)

def delete_local_sqlite(self):
if self.is_upload and os.path.exists(self.local_name):
os.remove(self.local_name)

@classmethod
def from_django_connection(cls, connection_alias):
Expand Down
81 changes: 50 additions & 31 deletions explorer/ee/db_connections/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from django.db.utils import load_backend
import os
import json

import hashlib
import sqlite3
import io

Expand All @@ -21,29 +21,23 @@ def upload_sqlite(db_bytes, path):
# to this new database connection. Oops!
# TODO: In the future, queries should probably be FK'ed to the ID of the connection, rather than simply
# storing the alias of the connection as a string.
def create_connection_for_uploaded_sqlite(filename, user_id, s3_path):
def create_connection_for_uploaded_sqlite(filename, s3_path):
from explorer.models import DatabaseConnection
base, ext = os.path.splitext(filename)
filename = f"{base}_{user_id}{ext}"
return DatabaseConnection.objects.create(
alias=f"{filename}",
alias=filename,
engine=DatabaseConnection.SQLITE,
name=filename,
host=s3_path
host=s3_path,
)


def get_sqlite_for_connection(explorer_connection):
from explorer.utils import get_s3_bucket

# Get the database from s3, then modify the connection to work with the downloaded file.
# E.g. "host" should not be set, and we need to get the full path to the file
local_name = explorer_connection.local_name
if not os.path.exists(local_name):
s3 = get_s3_bucket()
s3.download_file(explorer_connection.host, local_name)
explorer_connection.download_sqlite_if_needed()
# Note the order here is important; .local_name checked "is_upload" which relies on .host being set
explorer_connection.name = explorer_connection.local_name
explorer_connection.host = None
explorer_connection.name = local_name
return explorer_connection


Expand All @@ -54,6 +48,10 @@ def user_dbs_local_dir():
return d


def uploaded_db_local_path(name):
return os.path.join(user_dbs_local_dir(), name)


def create_django_style_connection(explorer_connection):

if explorer_connection.is_upload:
Expand Down Expand Up @@ -87,24 +85,45 @@ def create_django_style_connection(explorer_connection):
raise DatabaseError(f"Failed to create explorer connection: {e}") from e


def pandas_to_sqlite(df, local_path="local_database.db"):
# Write the DataFrame to a local SQLite database
# In theory, it would be nice to write the dataframe to an in-memory SQLite DB, and then dump the bytes from that
# but there is no way to get to the underlying bytes from an in-memory SQLite DB
con = sqlite3.connect(local_path)
try:
df.to_sql(name="data", con=con, if_exists="replace", index=False)
finally:
con.close()
def sqlite_to_bytesio(local_path):
# Write the file to disk. It'll be uploaded to s3, and left here locally for querying
db_file = io.BytesIO()
with open(local_path, "rb") as f:
Dismissed Show dismissed Hide dismissed
db_file.write(f.read())
db_file.seek(0)
return db_file


def pandas_to_sqlite(df, table_name, local_path):
# Write the DataFrame to a local SQLite database and return it as a BytesIO object.
# This intentionally leaves the sqlite db on the local disk so that it is ready to go for
# querying immediately after the connection has been created. Removing it would also be OK, since
# the system knows to re-download it if it's not available, but this saves an extra download from S3.
conn = sqlite3.connect(local_path)

# Read the local SQLite database file into a BytesIO buffer
try:
db_file = io.BytesIO()
with open(local_path, "rb") as f:
db_file.write(f.read())
db_file.seek(0)
return db_file
df.to_sql(table_name, conn, if_exists="replace", index=False)
finally:
# Delete the local SQLite database file
# Finally block to ensure we don't litter files around
os.remove(local_path)
conn.commit()
conn.close()

return sqlite_to_bytesio(local_path)


def quick_hash(file_path, num_samples=10, sample_size=1024):
hasher = hashlib.sha256()
file_size = os.path.getsize(file_path)

if file_size == 0:
return hasher.hexdigest()

sample_interval = file_size // num_samples
with open(file_path, "rb") as f:
for i in range(num_samples):
f.seek(i * sample_interval)
sample_data = f.read(sample_size)
if not sample_data:
break
hasher.update(sample_data)

return hasher.hexdigest()
Loading