In this guide, we’ll show how to automatically discover labels or categories, then classify each review at scale. This saves time and helps uncover hidden insights in your feedback data quickly.
Create or open a new Jupyter Notebook. Start by importing the necessary libraries, setting up any environment variables (e.g., API keys), and optionally creating a folder for storing JSON artifacts.
In this step, we import libraries, set any needed environment variables, and optionally create a folder for JSON artifacts (e.g., skills, tasks, results).
import os
import json
from openai import OpenAI
from flashlearn.skills.discover_labels import DiscoverLabelsSkill
from flashlearn.skills.classification import ClassificationSkill
# (Optional) Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# (Optional) Create a folder for JSON artifacts
json_folder = "json_artifacts"
os.makedirs(json_folder, exist_ok=True)
print("Step 0 complete: Imports and environment setup.")
We define a simple list of dictionaries with text-based reviews. Each entry has a "comment" field for demonstration purposes.
In this step, we compose a list of text reviews or sentences to demonstrate label discovery and classification.
# Example text reviews
text_reviews = [
{"comment": "Battery life exceeded expectations, though camera was mediocre."},
{"comment": "Arrived late, but customer service was supportive."},
{"comment": "The product is overpriced. I want a refund."},
{"comment": "Fantastic design and smooth performance."},
{"comment": "Shipping was incredibly fast, but packaging could be improved."}
]
print("Step 1 complete: Data prepared.")
print("Sample data:", text_reviews)
Label discovery is useful for automatically inferring potential categories or topics from the entire dataset.
Here, we use the “DiscoverLabelsSkill” to find relevant labels from the dataset in a single pass. We supply a model name and an OpenAI client.
def get_discover_skill():
# Create a client (OpenAI in this example)
client = OpenAI()
discover_skill = DiscoverLabelsSkill(
model_name="gpt-4o-mini",
client=client
)
return discover_skill
discover_skill = get_discover_skill()
print("Step 2 complete: Label discovery skill initialized.")
We transform our list of text reviews into the “tasks” format expected by DiscoverLabelsSkill. Since this is text-only data, our column modalities can either be omitted or simply mark “comment” as text.
We convert the text data into tasks that the discovery skill can process. For text-only fields, we pass “comment” as having a “text” modality.
column_modalities = {
"comment": "text"
}
# Create discovery tasks
tasks_discover = discover_skill.create_tasks(
text_reviews,
column_modalities=column_modalities
)
print("Step 3 complete: Discovery tasks created.")
print("Sample discovery task:", tasks_discover[0])
Although optional, saving tasks to JSONL allows for offline processing or auditing.
This helps maintain reproducibility and a clear record of how tasks were generated.
tasks_discover_jsonl_path = os.path.join(json_folder, "discovery_tasks.jsonl")
with open(tasks_discover_jsonl_path, 'w') as f:
for task in tasks_discover:
f.write(json.dumps(task) + '\n')
print(f"Step 4 complete: Discovery tasks saved to {tasks_discover_jsonl_path}")
You can reload the tasks from JSONL at any time. This is helpful if you are performing the discovery in a separate environment.
Demonstrates how to retrieve tasks from the JSONL file we just saved.
loaded_discovery_tasks = []
with open(tasks_discover_jsonl_path, 'r') as f:
for line in f:
loaded_discovery_tasks.append(json.loads(line))
print("Step 5 complete: Discovery tasks reloaded from JSONL.")
print("A sample reloaded discovery task:", loaded_discovery_tasks[0])
We now run these tasks through the discovery skill, which should suggest a set of labels or topics.
We request the skill to analyze all tasks, returning any discovered labels. Typically, the output is contained in a single record (often keyed by "0") with a "labels" field.
discovery_output = discover_skill.run_tasks_in_parallel(loaded_discovery_tasks)
discovered_labels = discovery_output.get("0", {}).get("labels", [])
print("Step 6 complete: Labels discovered.")
print("Discovered labels:", discovered_labels)
Once we have a set of labels, we can assign them to individual data entries using ClassificationSkill.
We initialize a “ClassificationSkill” using the discovered labels, which the skill will then apply to new or existing text records.
def get_classification_skill(labels):
client = OpenAI()
classify_skill = ClassificationSkill(
model_name="gpt-4o-mini",
client=client,
categories=labels
)
return classify_skill
classify_skill = get_classification_skill(discovered_labels)
print("Step 7 complete: Classification skill initialized with discovered labels.")
We now form classification tasks for each text review, letting the skill know where to find the textual content.
Here, we transform our text data into classification tasks, telling the skill which column to treat as text.
tasks_classify = classify_skill.create_tasks(
text_reviews,
column_modalities={"comment": "text"}
)
print("Step 8 complete: Classification tasks created.")
print("Sample classification task:", tasks_classify[0])
Next, we pass the classification tasks to the skill, retrieving a dictionary keyed by task IDs with the assigned category or categories.
We run the tasks in parallel (if supported) or sequentially, then inspect the classification results.
classification_results = classify_skill.run_tasks_in_parallel(tasks_classify)
print("Step 9 complete: Classification finished.")
print("Classification results (first few):")
for task_id, cats in list(classification_results.items())[:3]:
print(f" Task ID {task_id}: {cats}")
Finally, we attach each classification result to the original text data, storing the outcome in a JSONL file for future analysis.
We map each classification result to its corresponding data record and (optionally) write the annotated data to a JSONL file for safe-keeping and further analysis.
# Map the classification results back to the original data
annotated_data = []
for task_id_str, output_json in classification_results.items():
task_id = int(task_id_str)
record = text_reviews[task_id]
record["classification"] = output_json # attach the classification result
annotated_data.append(record)
# Print a sample annotated record
print("Sample annotated record:", annotated_data[0])
# (Optional) Save annotated data to JSONL
final_results_path = os.path.join(json_folder, "classification_results.jsonl")
with open(final_results_path, 'w') as f:
for rec in annotated_data:
f.write(json.dumps(rec) + '\n')
print(f"Step 10 complete: Annotated results saved to {final_results_path}")
Using these steps, you have:
- Text data prepared in a list of dictionaries.
- A label discovery skill that aggregates topics across your dataset.
- A classification skill that assigns those discovered labels to each record.
- Task creation, storability (JSONL), and reloadability to maintain transparent, reproducible workflows.
- Final annotated data, also in JSONL, for deeper analytics or integration with other systems.
From here, you can refine or expand:
- Use more complex instructions for label discovery.
- Develop additional skills (e.g., sentiment analysis, summarization).
- Integrate discovered/classified labels into your data pipeline or dashboards.
This completes our text-only guide for discovering and classifying labels using flashlearn!