Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions data_juicer/tools/DJ_mcp_recipe_flow.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@ def get_data_processing_ops(
Operators are a collection of basic processes that assist in data modification,
cleaning, filtering, deduplication, etc.

Should be used with `run_data_recipe`.

If both tags and ops_type are None, return a list of all operators.

The following `op_type` values are supported:
Expand Down
62 changes: 62 additions & 0 deletions data_juicer/tools/mcp_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/env python3

import argparse
import os
import sys


def main():
"""Data-Juicer MCP Server CLI entry point."""
parser = argparse.ArgumentParser(
description="Data-Juicer MCP Server",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Available modes:
granular-ops Launch MCP server with individual operator tools
recipe-flow Launch MCP server with recipe-based workflow tools

Examples:
dj-mcp granular-ops --transport stdio
dj-mcp recipe-flow --transport sse --port 8000
""",
)

parser.add_argument("mode", choices=["granular-ops", "recipe-flow"], help="MCP server mode to launch")

parser.add_argument(
"--transport",
choices=["stdio", "sse", "streamable-http"],
default="stdio",
help="Transport protocol for MCP server (default: stdio)",
)

parser.add_argument("--port", type=int, default=8080, help="Port number for HTTP-based transports (default: 8080)")

args = parser.parse_args()

# Set environment variable for transport
os.environ["SERVER_TRANSPORT"] = args.transport

try:
if args.mode == "granular-ops":
from data_juicer.tools.DJ_mcp_granular_ops import create_mcp_server

elif args.mode == "recipe-flow":
from data_juicer.tools.DJ_mcp_recipe_flow import create_mcp_server

print(f"Starting Data-Juicer MCP Server ({args.mode} mode)")
print(f"Transport: {args.transport}, Port: {args.port}")

mcp = create_mcp_server(port=str(args.port))
mcp.run(transport=args.transport)

except ImportError as e:
print(f"Error: Missing dependencies for MCP server. {e}")
sys.exit(1)
except Exception as e:
print(f"Error starting MCP server: {e}")
sys.exit(1)


if __name__ == "__main__":
main()
226 changes: 119 additions & 107 deletions docs/DJ_service.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,130 +112,142 @@ The server supports two deployment methods: **stdio** and **SSE**. The **stdio**
- `export_path` (str, optional): The path to export the dataset to. Default is None, meaning the dataset will be exported to './outputs'
- Returns: A string representing the execution result

For specific data processing requests, the MCP client should first call `get_data_processing_ops` to obtain relevant operator information, select operators that match the requirements from it, and then call `run_data_recipe` to run the selected combination of operators.
For specific data processing requests, the MCP client should first call `get_data_processing_ops` to obtain relevant operator information, select the operators that meet the requirements, and then call `run_data_recipe` to execute the chosen operator combination.

#### Configuration

The following configuration examples demonstrate how to set up the Recipe-Flow server using both stdio and SSE transport methods. These examples are for illustrative purposes and should be adapted to your specific MCP client's configuration format.
### Granular-Operators

##### stdio Transport
By default, this MCP server returns all Data-Juicer operator tools, each running independently.

Add the following to your MCP client's configuration file (e.g., `claude_desktop_config.json` or a similar configuration file):
To control the operator tools returned by the MCP server, specify the environment variable `DJ_OPS_LIST_PATH`:
1. Create a `.txt` file.
2. Add operator names to the file, e.g., [ops_list_example.txt](../configs/mcp/ops_list_example.txt).
3. Set the path to the operator list as the environment variable `DJ_OPS_LIST_PATH`.

```json
"mcpServers": {
"DJ_recipe_flow": {
"transport": "stdio",
"command": "/path/to/python",
"args": [
"/path/to/data_juicer/tools/DJ_mcp_recipe_flow.py"
],
"env": {
"SERVER_TRANSPORT": "stdio"
}
}
}
```
### Configuration

##### SSE Transport
The following configuration examples demonstrate how to set up the two MCP server types using the stdio and SSE methods. These examples are for illustrative purposes only and should be adapted to the specific MCP client's configuration format.

To use the SSE transport, you first need to start the MCP server separately.
#### stdio

1. Run the Server: Execute the server script, specifying the port number:
Suitable for quick local testing and simple scenarios. Add the following to the MCP client's configuration file (e.g., `claude_desktop_config.json` or similar):

```bash
python /path/to/data_juicer/tools/DJ_mcp_recipe_flow.py --port=8080
```
##### Using uvx

2. Configure your MCP Client: Add the following to your MCP client's configuration file:
Run the latest version of Data-Juicer MCP directly from the repository without manual local installation.

```json
- Recipe-Flow mode:
```json
{
"mcpServers": {
"DJ_recipe_flow": {
"url": "http://127.0.0.1:8080/sse"
"command": "uvx",
"args": [
"--from",
"git+https://github.com/modelscope/data-juicer",
"dj-mcp",
"recipe-flow"
]
}
}
```

Note:

* URL: The `url` should point to the SSE endpoint of your running server (typically `http://127.0.0.1:<port>/sse`). Adjust the port number if you used a different value when starting the server.
* Separate Server Process: The SSE server must be running before your MCP client attempts to connect.
* Firewall: Ensure that your firewall allows connections to the specified port.

### Granular-Operators

By default, this MCP server will return all Data-Juicer operator tools, each running independently.

You can control the operator tools returned by the MCP server by specifying the environment variable `DJ_OPS_LIST_PATH`:

1. Create a `.txt` file.
2. Add operator names to the file, such as: [ops_list_example.txt](../configs/mcp/ops_list_example.txt).
3. Set the path to the operators list as the environment variable `DJ_OPS_LIST_PATH`.

#### Configuration

The following configuration examples demonstrate how to set up the Granular-Operators server using both stdio and SSE transport methods. These examples are for illustrative purposes and should be adapted to your specific MCP client's configuration format.

##### stdio Transport

Add the following to your MCP client's configuration file:
}
```

```json
"mcpServers": {
"DJ_granular_ops_stdio": {
"transport": "stdio",
"command": "/path/to/python",
"args": [
"/path/to/data_juicer/tools/DJ_mcp_granular_ops.py"
],
"env": {
"DJ_OPS_LIST_PATH": "/path/to/ops_list.txt",
"SERVER_TRANSPORT": "stdio"
- Granular-Operators mode:
```json
{
"mcpServers": {
"DJ_granular_ops": {
"command": "uvx",
"args": [
"--from",
"git+https://github.com/modelscope/data-juicer",
"dj-mcp",
"granular-ops",
"--transport",
"stdio"
],
"env": {
"DJ_OPS_LIST_PATH": "/path/to/ops_list.txt"
}
}
}
}
}
```

##### SSE Transport

To use the SSE transport, you first need to start the MCP server separately.

1. Set Environment Variables: Ensure any required environment variables for the server are set, including `DJ_OPS_LIST_PATH` if you're using it.
2. Run the Server: Execute the server script, specifying the port number:

```bash
python /path/to/data_juicer/tools/DJ_mcp_granular_ops.py --port=8081
```

3. Configure your MCP Client: Add the following to your MCP client's configuration file:

```json
```
Note: If `DJ_OPS_LIST_PATH` is not set, all operators are returned by default.

##### Local Installation

1. Clone the Data-Juicer repository locally:
```bash
git clone https://github.com/modelscope/data-juicer.git
```
2. Run Data-Juicer MCP using uv:
- Recipe-Flow mode:
```json
{
"mcpServers": {
"DJ_granular_ops_sse": {
"url": "http://127.0.0.1:8081/sse"
"DJ_recipe_flow": {
"transport": "stdio",
"command": "uv",
"args": [
"run",
"--directory",
"/abs/path/to/data-juicer",
"dj-mcp",
"recipe-flow"
]
}
}
```

Note:

* URL: The `url` should point to the SSE endpoint of your running server (typically `http://127.0.0.1:<port>/sse`). Adjust the port number if you used a different value when starting the server.
* Separate Server Process: The SSE server must be running before your MCP client attempts to connect.
* Firewall: Ensure that your firewall allows connections to the specified port.

### Finding Your Python Path

To find the path to the Python executable, use the following commands:

Windows (Command Prompt/Terminal):

```sh
where python
```

Linux/macOS (Terminal):

```sh
which python
```
}
```
- Granular-Operators mode:
```json
{
"mcpServers": {
"DJ_granular_ops": {
"transport": "stdio",
"command": "uv",
"args": [
"run",
"--directory",
"/abs/path/to/data-juicer",
"dj-mcp",
"granular-ops"
],
"env": {
"DJ_OPS_LIST_PATH": "/path/to/ops_list.txt"
}
}
}
}
```

#### SSE

To use SSE deployment, first start the MCP server separately.

1. Run the MCP server: Execute the MCP server script and specify the port number:
- Using uvx:
```bash
uvx --from git+https://github.com/modelscope/data-juicer dj-mcp <MODE: recipe-flow/granular-ops> --transport sse --port 8080
```
- Local execution:
```bash
uv run dj-mcp <MODE: recipe-flow/granular-ops> --transport sse --port 8080
```

2. Configure your MCP client: Add the following to the MCP client's configuration file:
```json
{
"mcpServers": {
"DJ_MCP": {
"url": "http://127.0.0.1:8080/sse"
}
}
}
```

Notes:
- URL: The `url` should point to the SSE endpoint of the running server (typically `http://127.0.0.1:<port>/sse`). Adjust the port number if a different value was used when starting the server.
- Separate server process: The SSE server must be running before the MCP client attempts to connect.
- Firewall: Ensure the firewall allows connections to the specified port.
Loading