Excuse me. How can I evaluate the initial effect? #10

enjoylife8888 · 2025-01-15T02:39:50Z

Hello author, I am very interested in your paper. While reading the paper and code, I found that the initial results of the paper are quite good. May I ask how you evaluated them? Is there any related code available?

Moreover, I have a question: Is the preliminary SQL file already the initial effect?

# Construct `ppl_dev.json`. 
python src/data_construct.py 

#Construct few-shot examples pairs
python few_shot/construct_QA.py 

# Generate few-shot examples
python few_shot/slg_main.py --dataset src/information/ppl_dev.json --out_file src/information/example.json --kshot 3

# add few-shot examples to ppl_dev.json
python src/information/add_example.py

# step 1: preliminary sql
# There are two output files in this step, one is `src/sql_log/preliminary_sql.txt` and the other is `src/schema_linking/LLM.json`
python src/step_1_preliminary_sql.py --ppl_file src/information/ppl_dev.json --sql_out_file src/sql_log/preliminary_sql.txt --Schema_linking_LLM src/schema_linking/LLM.json --start_index 0

As shown in README.md, executing python src/step_1_preliminary_sql.py will yield the initial results presented in the paper, correct? Therefore, by evaluating preliminary_sql.txt, the initial results can be obtained. Is this correct?

BTW. Which one is used for the initial effect: filter schema, full schema, or schema with augmentation?

Relly looking forward to your reply. Thank you.

The text was updated successfully, but these errors were encountered:

enjoylife8888 · 2025-01-15T04:00:01Z

I attempted to evaluate preliminary_sql.txt separately, and the execution results are as follows:

It seems that the score obtained using the full schema is even higher than the initial results, and bidirectional linking has not yet been performed.

Laqcce-cao · 2025-01-17T09:42:23Z

The three results are not presented. The name of folders in Main_result are one-to-one correspondence to the results in picture.

enjoylife8888 · 2025-01-17T15:43:04Z

I attempted to evaluate preliminary_sql.txt separately, and the execution results are as follows:

It seems that the score obtained using the full schema is even higher than the initial results, and bidirectional linking has not yet been performed.

@Laqcce-cao So you mean the results reproduced in the image are obtained after step 1: Bidirectional Schema Linking, right?

BTW. Could you please provide the code for these three results before step1?

Another question about RSL: When I was reproducing with GPT-4, I noticed that the scores from step 1 to step 4 showed a trend of first decreasing and then increasing. Does this mean that the effect and role of step 2 and are not significant?

Overall Performance

Step1 Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path ./src/sql_log/preliminary_sql.txt --ground_truth_path ./data/dev.sql --data_mode dev --db_root_path ./database/dev_databases/ --diff_json_path ./data/dev.json --num_cpus 8
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             69.95                52.59                50.34                **62.84**
===========================================================================================
Finished evaluation

Step2 Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path .\src\sql_log\step_2_information_augmentation.txt --ground_truth_path .\data\dev.sql --data_mode dev --db_root_path .\database\dev_databases\ --num_cpus 8 --diff_json_path .\data\dev.json
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             69.19                52.37                54.48                **62.71**
===========================================================================================
Finished evaluation

Step3 Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path .\src\sql_log\step_3_binary.txt --ground_truth_path .\data\dev.sql --data_mode dev --db_root_path .\database\dev_databases\ --num_cpus 8 --diff_json_path .\data\dev.json
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             72.32                56.03                57.93                66.04
===========================================================================================
Finished evaluation

Step4 Final Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path .\src\sql_log\final_sql.txt --ground_truth_path ./data/dev.sql --data_mode dev --db_root_path ./database/dev_databases/ --diff_json_path ./data/dev.json --num_cpus 8    
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             72.76                57.11                58.62                66.69
===========================================================================================
Finished evaluation

As the results show, the overall effect of step2 has slightly decreased compared to step1, contrary to the improvement mentioned in the ablation experiments of the paper. What's more, compared to the overall score of 67.21 in the paper, the result I got using GPT was only 66.69, a difference of 0.5 points.

I conducted three separate experiments, and the resulting scores were almost identical. Therefore, I have to question the effectiveness of step 2 and the authenticity of reproducing the paper's results.

Laqcce-cao · 2025-01-21T06:08:39Z

log.zip

My experimental data is here for you to test.

I'd like to know if you made any changes to my code when reproducing it. For example, the organizational form of PROMPT. If possible, you can use DeepSeek to reproduce it. DeepSeek is relatively cheap.

Laqcce-cao · 2025-01-21T10:03:50Z

It's worth noting that in different LLMs, It is possible that Step 2 may not always outperform Step 1. This is just the potential risk we refer to as "schema linking".

However, Step 3 relies on both Step 1 and Step 2. The SQL generated in Step 2 greatly helps improve the performance of Step 3.

The SQL generated in Steps 1 and 2 each has its own advantages, and a significant portion of their correct and incorrect cases do not overlap. Step 3 hedges the risks between the two.

Step 2 is not yet perfect, and we are currently exploring ways to further enhance its robustness.

Our other experiments (may not be included in the paper) show that within the RSL-SQL framework, performance improvements in either Step 1 or Step 2 will contribute to boosting the final performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excuse me. How can I evaluate the initial effect? #10

Excuse me. How can I evaluate the initial effect? #10

enjoylife8888 commented Jan 15, 2025

enjoylife8888 commented Jan 15, 2025

Laqcce-cao commented Jan 17, 2025

enjoylife8888 commented Jan 17, 2025 •

edited

Loading

Laqcce-cao commented Jan 21, 2025

Laqcce-cao commented Jan 21, 2025

Excuse me. How can I evaluate the initial effect? #10

Excuse me. How can I evaluate the initial effect? #10

Comments

enjoylife8888 commented Jan 15, 2025

enjoylife8888 commented Jan 15, 2025

Laqcce-cao commented Jan 17, 2025

enjoylife8888 commented Jan 17, 2025 • edited Loading

Overall Performance

Step1 Effect

Step2 Effect

Step3 Effect

Step4 Final Effect

Laqcce-cao commented Jan 21, 2025

Laqcce-cao commented Jan 21, 2025

enjoylife8888 commented Jan 17, 2025 •

edited

Loading