Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excuse me. How can I evaluate the initial effect? #10

Open
enjoylife8888 opened this issue Jan 15, 2025 · 5 comments
Open

Excuse me. How can I evaluate the initial effect? #10

enjoylife8888 opened this issue Jan 15, 2025 · 5 comments

Comments

@enjoylife8888
Copy link

Hello author, I am very interested in your paper. While reading the paper and code, I found that the initial results of the paper are quite good. May I ask how you evaluated them? Is there any related code available?

image

Moreover, I have a question: Is the preliminary SQL file already the initial effect?

# Construct `ppl_dev.json`. 
python src/data_construct.py 

#Construct few-shot examples pairs
python few_shot/construct_QA.py 

# Generate few-shot examples
python few_shot/slg_main.py --dataset src/information/ppl_dev.json --out_file src/information/example.json --kshot 3

# add few-shot examples to ppl_dev.json
python src/information/add_example.py

# step 1: preliminary sql
# There are two output files in this step, one is `src/sql_log/preliminary_sql.txt` and the other is `src/schema_linking/LLM.json`
python src/step_1_preliminary_sql.py --ppl_file src/information/ppl_dev.json --sql_out_file src/sql_log/preliminary_sql.txt --Schema_linking_LLM src/schema_linking/LLM.json --start_index 0

As shown in README.md, executing python src/step_1_preliminary_sql.py will yield the initial results presented in the paper, correct? Therefore, by evaluating preliminary_sql.txt, the initial results can be obtained. Is this correct?

BTW. Which one is used for the initial effect: filter schema, full schema, or schema with augmentation?

Relly looking forward to your reply. Thank you.

@enjoylife8888
Copy link
Author

I attempted to evaluate preliminary_sql.txt separately, and the execution results are as follows:

deepseek

gpt-4o

It seems that the score obtained using the full schema is even higher than the initial results, and bidirectional linking has not yet been performed.

@Laqcce-cao
Copy link
Owner

ImageThe three results are not presented. The name of folders in Main_result are one-to-one correspondence to the results in picture.

@enjoylife8888
Copy link
Author

enjoylife8888 commented Jan 17, 2025

I attempted to evaluate preliminary_sql.txt separately, and the execution results are as follows:

deepseek

gpt-4o

It seems that the score obtained using the full schema is even higher than the initial results, and bidirectional linking has not yet been performed.

@Laqcce-cao So you mean the results reproduced in the image are obtained after step 1: Bidirectional Schema Linking, right?

BTW. Could you please provide the code for these three results before step1?

Another question about RSL: When I was reproducing with GPT-4, I noticed that the scores from step 1 to step 4 showed a trend of first decreasing and then increasing. Does this mean that the effect and role of step 2 and are not significant?

Overall Performance

Step1 Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path ./src/sql_log/preliminary_sql.txt --ground_truth_path ./data/dev.sql --data_mode dev --db_root_path ./database/dev_databases/ --diff_json_path ./data/dev.json --num_cpus 8
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             69.95                52.59                50.34                **62.84**
===========================================================================================
Finished evaluation

Step2 Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path .\src\sql_log\step_2_information_augmentation.txt --ground_truth_path .\data\dev.sql --data_mode dev --db_root_path .\database\dev_databases\ --num_cpus 8 --diff_json_path .\data\dev.json
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             69.19                52.37                54.48                **62.71**
===========================================================================================
Finished evaluation

Step3 Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path .\src\sql_log\step_3_binary.txt --ground_truth_path .\data\dev.sql --data_mode dev --db_root_path .\database\dev_databases\ --num_cpus 8 --diff_json_path .\data\dev.json
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             72.32                56.03                57.93                66.04
===========================================================================================
Finished evaluation                                    

Step4 Final Effect

RSL-SQL> python .\evaluation\evaluation_changed.py --predicted_sql_path .\src\sql_log\final_sql.txt --ground_truth_path ./data/dev.sql --data_mode dev --db_root_path ./database/dev_databases/ --diff_json_path ./data/dev.json --num_cpus 8    
start calculate
                     simple               moderate             challenging          total
count                925                  464                  145                  1534
======================================    ACCURACY    =====================================
accuracy             72.76                57.11                58.62                66.69
===========================================================================================
Finished evaluation

As the results show, the overall effect of step2 has slightly decreased compared to step1, contrary to the improvement mentioned in the ablation experiments of the paper. What's more, compared to the overall score of 67.21 in the paper, the result I got using GPT was only 66.69, a difference of 0.5 points.

I conducted three separate experiments, and the resulting scores were almost identical. Therefore, I have to question the effectiveness of step 2 and the authenticity of reproducing the paper's results.

@Laqcce-cao
Copy link
Owner

log.zip

My experimental data is here for you to test.

I'd like to know if you made any changes to my code when reproducing it. For example, the organizational form of PROMPT. If possible, you can use DeepSeek to reproduce it. DeepSeek is relatively cheap.

@Laqcce-cao
Copy link
Owner

It's worth noting that in different LLMs, It is possible that Step 2 may not always outperform Step 1. This is just the potential risk we refer to as "schema linking".

However, Step 3 relies on both Step 1 and Step 2. The SQL generated in Step 2 greatly helps improve the performance of Step 3.

The SQL generated in Steps 1 and 2 each has its own advantages, and a significant portion of their correct and incorrect cases do not overlap. Step 3 hedges the risks between the two.

Step 2 is not yet perfect, and we are currently exploring ways to further enhance its robustness.

Our other experiments (may not be included in the paper) show that within the RSL-SQL framework, performance improvements in either Step 1 or Step 2 will contribute to boosting the final performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants