documentation, cleaner prompts

explorerhq · Aug 27, 2024 · 2ed69f6 · 2ed69f6
1 parent 6586b63
commit 2ed69f6
Show file tree

Hide file tree

Showing 5 changed files with 85 additions and 44 deletions.
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -7,13 +7,40 @@ This project adheres to `Semantic Versioning <https://semver.org/>`_.
 
 vNext
 ===========================
-* `#660`_: Userspace connection migration. This should be an invisible change, but represents a significant refactor of how connections function.
-Instead of a weird blend of DatabaseConnection models and underlying Django models (which were the original Explorer connections),
-this migrates all connections to DatabaseConnection models and implements proper foreign keys to them on the Query and QueryLog models.
-A data migration creates new DatabaseConnection models based on the configured settings.EXPLORER_CONNECTIONS.
-Going forward, admins can create new Django-backed DatabaseConnection models by registering the connection in EXPLORER_CONNECTIONS, and then creating a
-DatabaseConnection model using the Django admin or the user-facing /connections/new/ form, and entering the Django DB alias and setting the connection type to "Django Connection"
-
+* Keyboard shortcut for formatting the SQL in the editor.
+
+  - Cmd+Shift+F (Windows: Ctrl+Shift+F)
+  - The format button has been moved tobe a small icon towards the bottom-right of the SQL editor.
+
+* `#664`_: Improvements to the AI SQL Assistant:
+
+  - Table Annotations: Write persistent table annotations with descriptive information that will get injected into the
+    prompt for the assistant. For example, if a table is commonly joined to another table through a non-obvious foreign
+    key, you can tell the assistant about it in plain english, as an annotation to that table. Every time that table is
+    deemed 'relevant' to an assistant request, that annotation will be included alongside the schema and sample data.
+  - Few-Shot Examples: Using the small checkbox on the bottom-right of any saved queries, you can designate certain
+    queries as 'few shot examples". When making an assistant request, any designated few-shot examples that reference
+    the same tables as your assistant request will get included as 'reference sql' in the prompt for the LLM.
+  - Autocomplete / multiselect when selecting tables info to send to the SQL Assistant. Much easier and more keyboard
+    focused.
+  - Relevant tables are added client-side visually, in real time, based on what's in the SQL editor. The dependency on
+    sql_metadata is therefore removed, as server-side SQL parsing is no longer necessary
+  - Improved system prompt that emphasizes the particular SQL dialect being used.
+  - Addresses issue #657.
+
+* `#660`_: Userspace connection migration.
+
+  - This should be an invisible change, but represents a significant refactor of how connections function. Instead of a
+    weird blend of DatabaseConnection models and underlying Django models (which were the original Explorer
+    connections), this migrates all connections to DatabaseConnection models and implements proper foreign keys to them
+    on the Query and QueryLog models. A data migration creates new DatabaseConnection models based on the configured
+    settings.EXPLORER_CONNECTIONS. Going forward, admins can create new Django-backed DatabaseConnection models by
+    registering the connection in EXPLORER_CONNECTIONS, and then creating a DatabaseConnection model using the Django
+    admin or the user-facing /connections/new/ form, and entering the Django DB alias and setting the connection type
+    to "Django Connection".
+  - The Query.connection and QueryLog.connection fields are deprecated and will be removed in a future release. They
+    are kept around in this release in case there is an unforeseen issue with the migration. Preserving the fields for
+    now ensures there is no data loss in the event that a rollback to an earlier version is required.
 
 `5.2.0`_ (2024-08-19)
 ===========================

diff --git a/docs/features.rst b/docs/features.rst
@@ -5,7 +5,19 @@ SQL Assistant
 -------------
 - Built in integration with OpenAI (or the LLM of your choosing)
   to quickly get help with your query, with relevant schema
-  automatically injected into the prompt. Simple, effective.
+  automatically injected into the prompt.
+- The assistant tries hard to get relevant context into the prompt to the LLM, alongside your explicit request. You
+  can choose tables to include explicitly (and any tables you are reference in your SQL you will see get included as
+  well). When a table is "included", the prompt will include the schema of the table, 3 sample rows, any Table
+  Annotations you have added, and any designated "few shot examples". More on each of those below.
+- Table Annotations: Write persistent table annotations with descriptive information that will get injected into the
+  prompt for the assistant. For example, if a table is commonly joined to another table through a non-obvious foreign
+  key, you can tell the assistant about it in plain english, as an annotation to that table. Every time that table is
+  deemed 'relevant' to an assistant request, that annotation will be included alongside the schema and sample data.
+- Few-shot examples: Using the small checkbox on the bottom-right of any saved query, you can designate queries as
+  "Assistant Examples". When making an assistant request, the 'included tables' are intersected with tables referenced
+  by designated Example queries, and those queries are injected into the prompt, and the LLM is told that that these
+  are good reference queries.
 
 Database Support
 ----------------
@@ -222,8 +234,7 @@ Power tips
   view.
 - Command+Enter and Ctrl+Enter will execute a query when typing in
   the SQL editor area.
-- Hit the "Format" button to format and clean up your SQL (this is
-  non-validating -- just formatting).
+- Cmd+Shift+F (Windows: Ctrl+Shift+F) to format the SQL in the editor.
 - Use the Query Logs feature to share one-time queries that aren't
   worth creating a persistent query for. Just run your SQL in the
   playground, then navigate to ``/logs`` and share the link

diff --git a/explorer/assistant/utils.py b/explorer/assistant/utils.py
@@ -42,7 +42,7 @@ def table_schema(db_connection, table_name):
     schema = schema_info(db_connection)
     s = [table for table in schema if table[0] == table_name]
     if len(s):
-        return s[0]
+        return s[0][1]
 
 
 def sample_rows_from_table(connection, table_name):
@@ -72,8 +72,8 @@ def sample_rows_from_table(connection, table_name):
                 new_val = field
                 if isinstance(field, str) and len(field) > MAX_FIELD_SAMPLE_SIZE:
                     new_val = field[:MAX_FIELD_SAMPLE_SIZE] + "..."  # Truncate and add ellipsis
-                elif isinstance(field, (bytes, bytearray)) and len(field) > MAX_FIELD_SAMPLE_SIZE:
-                    new_val = field[:MAX_FIELD_SAMPLE_SIZE] + b"..."  # Truncate binary data
+                elif isinstance(field, (bytes, bytearray)):
+                    new_val = "<binary_data>"
                 processed_row.append(new_val)
             ret.append(processed_row)
 
@@ -96,8 +96,7 @@ def format_rows_from_table(rows):
 
 def build_system_prompt(flavor):
     bsp = ExplorerValue.objects.get_item(ExplorerValue.ASSISTANT_SYSTEM_PROMPT).value
-    bsp += f"""\n\nYou are an expert at writing SQL, specifically for {flavor}, and account for the nuances
-    of this dialect of SQL. You always respond with valid {flavor} SQL."""
+    bsp += f"\nYou are an expert at writing SQL, specifically for {flavor}, and account for the nuances of this dialect of SQL. You always respond with valid {flavor} SQL."  # noqa
     return bsp
 
 
@@ -125,17 +124,30 @@ def get_relevant_few_shots(db_connection, included_tables):
     ).filter(query_conditions)
 
 
+def get_few_shot_chunk(db_connection, included_tables):
+    included_tables = [t.lower() for t in included_tables]
+    few_shot_examples = get_relevant_few_shots(db_connection, included_tables)
+    if few_shot_examples:
+        return "## Relevant example queries, written by expert SQL analysts ##\n" + "\n\n".join(
+            [f"Description: {fs.title} - {fs.description}\nSQL:\n{fs.sql}"
+             for fs in few_shot_examples.all()]
+        )
+
+
 @dataclass
 class TablePromptData:
     name: str
-    schema: str
+    schema: list
     sample: list
     annotation: TableDescription
 
     def render(self):
+        fmt_schema = "\n".join([str(field) for field in self.schema])
         ret = f"""## Information for Table '{self.name}' ##
-        Schema:\n{self.schema}
-        Sample rows:\n{format_rows_from_table(self.sample)}"""
+
+Schema:\n{fmt_schema}
+
+Sample rows:\n{format_rows_from_table(self.sample)}"""
         if self.annotation:
             ret += f"\nUsage Notes:\n{self.annotation.description}"
         return ret
@@ -144,8 +156,8 @@ def render(self):
 def build_prompt(db_connection, assistant_request, included_tables, query_error=None, sql=None):
     included_tables = [t.lower() for t in included_tables]
 
-    error_chunk = f"## Query Error ##\n{query_error}" if query_error else ""
-    sql_chunk = f"## Existing User-Written SQL ##\n{sql}" if sql else ""
+    error_chunk = f"## Query Error ##\n{query_error}" if query_error else None
+    sql_chunk = f"## Existing User-Written SQL ##\n{sql}" if sql else None
     request_chunk = f"## User's Request to Assistant ##\n{assistant_request}"
     table_chunks = [
         TablePromptData(
@@ -156,19 +168,12 @@ def build_prompt(db_connection, assistant_request, included_tables, query_error=
         ).render()
         for t in included_tables
     ]
+    few_shot_chunk = get_few_shot_chunk(db_connection, included_tables)
 
-    few_shot_examples = get_relevant_few_shots(db_connection, included_tables)
-    if few_shot_examples:
-        few_shot_chunk = "## Relevant example queries, written by expert SQL analysts ##\n" + "\n\n".join(
-            [f"""Description: {fs.title} - {fs.description}
-            SQL:\n{fs.sql}"""
-             for fs in few_shot_examples.all()]
-        )
-    else:
-        few_shot_chunk = ""
+    chunks = [error_chunk, sql_chunk, *table_chunks, few_shot_chunk, request_chunk]
 
     prompt = {
         "system": build_system_prompt(db_connection.as_django_connection().vendor),
-        "user": "\n\n".join([error_chunk, sql_chunk, *table_chunks, few_shot_chunk, request_chunk]),
+        "user": "\n\n".join([c for c in chunks if c]),
     }
     return prompt
diff --git a/explorer/src/js/uploads.js b/explorer/src/js/uploads.js
@@ -52,7 +52,7 @@ export function setupUploads() {
         }
 
         let xhr = new XMLHttpRequest();
-        xhr.open('POST', `${window.baseUrlPath}upload/`, true);
+        xhr.open('POST', `${window.baseUrlPath}connections/upload/`, true);
         xhr.setRequestHeader('X-CSRFToken', getCsrfToken());
 
         xhr.upload.onprogress = function(event) {

diff --git a/explorer/tests/test_assistant.py b/explorer/tests/test_assistant.py
@@ -39,23 +39,19 @@ def setUp(self):
         }
 
     @patch("explorer.assistant.utils.openai_client")
-    @patch("explorer.assistant.utils.num_tokens_from_string")
-    def test_do_modify_query(self, mocked_num_tokens, mocked_openai_client):
+    def test_do_modify_query(self, mocked_openai_client):
         from explorer.assistant.views import run_assistant
 
         # create.return_value should match: resp.choices[0].message
         mocked_openai_client.return_value.chat.completions.create.return_value = Mock(
             choices=[Mock(message=Mock(content="smart computer"))])
-        mocked_num_tokens.return_value = 100
         resp = run_assistant(self.request_data, None)
         self.assertEqual(resp, "smart computer")
 
     @patch("explorer.assistant.utils.openai_client")
-    @patch("explorer.assistant.utils.num_tokens_from_string")
-    def test_assistant_help(self, mocked_num_tokens, mocked_openai_client):
+    def test_assistant_help(self, mocked_openai_client):
         mocked_openai_client.return_value.chat.completions.create.return_value = Mock(
             choices=[Mock(message=Mock(content="smart computer"))])
-        mocked_num_tokens.return_value = 100
         resp = self.client.post(reverse("assistant"),
                                 data=json.dumps(self.request_data),
                                 content_type="application/json")
@@ -73,8 +69,9 @@ def test_build_prompt_with_vendor_only(self, mock_get_item):
         self.assertIn("sqlite", result["system"])
 
     @patch("explorer.assistant.utils.sample_rows_from_table", return_value="sample data")
+    @patch("explorer.assistant.utils.table_schema", return_value=[])
     @patch("explorer.models.ExplorerValue.objects.get_item")
-    def test_build_prompt_with_sql_and_annotation(self, mock_get_item, mock_sample_rows):
+    def test_build_prompt_with_sql_and_annotation(self, mock_get_item, mock_table_schema, mock_sample_rows):
         mock_get_item.return_value.value = "system prompt"
 
         included_tables = ["foo"]
@@ -86,8 +83,9 @@ def test_build_prompt_with_sql_and_annotation(self, mock_get_item, mock_sample_r
         self.assertIn("Usage Notes:\nannotated", result["user"])
 
     @patch("explorer.assistant.utils.sample_rows_from_table", return_value="sample data")
+    @patch("explorer.assistant.utils.table_schema", return_value=[])
     @patch("explorer.models.ExplorerValue.objects.get_item")
-    def test_build_prompt_with_few_shot(self, mock_get_item, mock_sample_rows):
+    def test_build_prompt_with_few_shot(self, mock_get_item, mock_table_schema, mock_sample_rows):
         mock_get_item.return_value.value = "system prompt"
 
         included_tables = ["magic"]
@@ -154,7 +152,7 @@ def test_truncates_long_strings(self):
         self.assertEqual(row[0], "a" * 200 + "...")
         self.assertEqual(row[1], "short string")
 
-    def test_truncates_long_binary_data(self):
+    def test_binary_data(self):
         long_binary = b"a" * 600
 
         # Mock database connection and cursor
@@ -169,8 +167,8 @@ def test_truncates_long_binary_data(self):
         header, row = ret
 
         self.assertEqual(header, ["col1", "col2"])
-        self.assertEqual(row[0], b"a" * 200 + b"...")
-        self.assertEqual(row[1], b"short binary")
+        self.assertEqual(row[0], "<binary_data>")
+        self.assertEqual(row[1], "<binary_data>")
 
     def test_handles_various_data_types(self):
         # Mock database connection and cursor
@@ -212,7 +210,7 @@ def test_format_rows_from_table(self):
     def test_schema_info_from_table_names(self):
         from explorer.assistant.utils import table_schema
         ret = table_schema(default_db_connection(), "explorer_query")
-        expected = ("explorer_query", [
+        expected = [
             ("id", "AutoField"),
             ("title", "CharField"),
             ("sql", "TextField"),
@@ -223,7 +221,7 @@ def test_schema_info_from_table_names(self):
             ("snapshot", "BooleanField"),
             ("connection", "CharField"),
             ("database_connection_id", "IntegerField"),
-            ("few_shot", "BooleanField")])
+            ("few_shot", "BooleanField")]
         self.assertEqual(ret, expected)