You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for open-sourcing the code for your paper! I have some questions regarding the implementation details that I'd appreciate clarification on:
Potential Data Leakage
In main.py/train(), the load_data function constructs the adj_matrix using the complete dataset (including the test set) for training. Could this implementation potentially leak test set information into the training process?
Dataset Split Strategy
The paper's Section IV-C states: "Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set."
However, the current implementation:
First splits projects into train_projects and test_projects (80:20)
Then splits methods within test_projects into train and test (80:20)
Could you elaborate on the rationale behind this different splitting strategy?
Training Data Construction
The paper mentions: "For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
However, the implementation uses the first 4 invocations of test_methods as training data. Additionally, I'm curious about how the visible context is utilized during the evaluation phase.
Relevant code snippet:
defsplit_data(self, conf):
"""conf: C1.1 C1.2"""self.config= (int(conf[1]), int(conf[3]))
np.random.seed(0)
test_proj_id=set(np.random.choice(range(self.nb_proj),
int(self.nb_proj*0.2), replace=False))
...
forpidintest_proj_id:
size=len(self.proj_have_users[pid])
# print('test pid and user size', pid, size)ifself.config[0] ==1: # remove half user methodsuser_id=self.proj_have_users[pid][: size//2]
elifself.config[0] ==2: # keep all user methodsuser_id=self.proj_have_users[pid]
ifself.config[1] ==2: # retain 4 invocations# use 0.2 percent methods per project as active methods for testgt_users=get_test_user(user_id, 5) # users having more than 5 invocationstest_cnt=len(gt_users) -int(len(gt_users)*0.8)
add_to_test(gt_users, test_cnt, 4)
ifself.config[1] ==1: # reserve the first invocationgt_users=get_test_user(user_id, 4)
test_cnt=len(user_id) -int(len(user_id)*0.8)
add_to_test(gt_users, test_cnt, 1)
...
forpidinrange(self.nb_proj):
ifpidintest_proj_id:
continueforuidinself.proj_have_users[pid]:
self.train_dict[uid] =self.invocation_mx[uid]
I would greatly appreciate your insights on these implementation details.
感谢开源这篇论文“Graph Neural Network Based Collaborative Filtering for API Usage Recommendation”的代码!
根据论文Methodology的描述以及代码实现,我发现了几个潜在问题,希望得到确认和讨论:
数据集划分方式与论文描述不一致
论文Section IV-C描述:"Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set." 但当前代码实现是:
训练数据构造的合理性
论文提到:"For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
但代码中将test_method的前4个invocation也用作训练数据。这种处理方式是否合理?此外,visible context在evaluation阶段是如何体现的?
期待能得到对这些问题的解答和讨论!
The text was updated successfully, but these errors were encountered:
Thank you for open-sourcing the code for your paper! I have some questions regarding the implementation details that I'd appreciate clarification on:
Potential Data Leakage
In
main.py/train()
, theload_data
function constructs theadj_matrix
using the complete dataset (including the test set) for training. Could this implementation potentially leak test set information into the training process?Dataset Split Strategy
The paper's Section IV-C states: "Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set."
However, the current implementation:
Could you elaborate on the rationale behind this different splitting strategy?
The paper mentions: "For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
However, the implementation uses the first 4 invocations of test_methods as training data. Additionally, I'm curious about how the visible context is utilized during the evaluation phase.
Relevant code snippet:
I would greatly appreciate your insights on these implementation details.
感谢开源这篇论文“Graph Neural Network Based Collaborative Filtering for API Usage Recommendation”的代码!
根据论文Methodology的描述以及代码实现,我发现了几个潜在问题,希望得到确认和讨论:
在main.py/train()中,load_data函数使用了完整数据集(包括测试集)构建adj_matrix,并基于此矩阵进行训练。这种实现方式是否导致了测试集信息的泄露?
论文Section IV-C描述:"Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set." 但当前代码实现是:
这种划分方式是否合理?能否解释采用这种不同划分策略的原因?
论文提到:"For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
但代码中将test_method的前4个invocation也用作训练数据。这种处理方式是否合理?此外,visible context在evaluation阶段是如何体现的?
期待能得到对这些问题的解答和讨论!
The text was updated successfully, but these errors were encountered: