Questions about Dataset Split and Training Implementation 数据集划分与训练实现中的潜在问题 #1

ZXXYy · 2024-12-13T14:33:57Z

Thank you for open-sourcing the code for your paper! I have some questions regarding the implementation details that I'd appreciate clarification on:

Potential Data Leakage
In main.py/train(), the load_data function constructs the adj_matrix using the complete dataset (including the test set) for training. Could this implementation potentially leak test set information into the training process?
Dataset Split Strategy
The paper's Section IV-C states: "Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set."
However, the current implementation:

First splits projects into train_projects and test_projects (80:20)
Then splits methods within test_projects into train and test (80:20)
Could you elaborate on the rationale behind this different splitting strategy?

Training Data Construction
The paper mentions: "For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
However, the implementation uses the first 4 invocations of test_methods as training data. Additionally, I'm curious about how the visible context is utilized during the evaluation phase.

Relevant code snippet:

def split_data(self, conf):
    """conf: C1.1 C1.2"""
    self.config = (int(conf[1]), int(conf[3]))
    np.random.seed(0)
    test_proj_id = set(np.random.choice(range(self.nb_proj), 
                      int(self.nb_proj*0.2), replace=False))
    ...
     for pid in test_proj_id:
            size = len(self.proj_have_users[pid])
            # print('test pid and user size', pid, size)
            if self.config[0] == 1:  # remove half user methods
                user_id = self.proj_have_users[pid][: size//2]
            elif self.config[0] == 2:  # keep all user methods
                user_id = self.proj_have_users[pid]
            if self.config[1] == 2:  # retain 4 invocations
                # use 0.2 percent methods per project as active methods for test
                gt_users = get_test_user(user_id, 5)  # users having more than 5 invocations
                test_cnt = len(gt_users) - int(len(gt_users)*0.8)
                add_to_test(gt_users, test_cnt, 4)
            if self.config[1] == 1:  # reserve the first invocation
                gt_users = get_test_user(user_id, 4)
                test_cnt = len(user_id) - int(len(user_id)*0.8)
                add_to_test(gt_users, test_cnt, 1)
      ...
      for pid in range(self.nb_proj):
                  if pid in test_proj_id:
                      continue
                  for uid in self.proj_have_users[pid]:
                      self.train_dict[uid] = self.invocation_mx[uid]

I would greatly appreciate your insights on these implementation details.

感谢开源这篇论文“Graph Neural Network Based Collaborative Filtering for API Usage Recommendation”的代码！
根据论文Methodology的描述以及代码实现，我发现了几个潜在问题，希望得到确认和讨论：

数据泄露问题
在main.py/train()中，load_data函数使用了完整数据集(包括测试集)构建adj_matrix，并基于此矩阵进行训练。这种实现方式是否导致了测试集信息的泄露？
数据集划分方式与论文描述不一致
论文Section IV-C描述："Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set." 但当前代码实现是：
- 先将projects按8:2分为train_projects和test_projects
- 再将test_projects中的methods按8:2分为train和test
  这种划分方式是否合理？能否解释采用这种不同划分策略的原因？
训练数据构造的合理性
论文提到："For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
但代码中将test_method的前4个invocation也用作训练数据。这种处理方式是否合理？此外，visible context在evaluation阶段是如何体现的？

期待能得到对这些问题的解答和讨论！

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about Dataset Split and Training Implementation 数据集划分与训练实现中的潜在问题 #1

Questions about Dataset Split and Training Implementation 数据集划分与训练实现中的潜在问题 #1

ZXXYy commented Dec 13, 2024

Questions about Dataset Split and Training Implementation 数据集划分与训练实现中的潜在问题 #1

Questions about Dataset Split and Training Implementation 数据集划分与训练实现中的潜在问题 #1

Comments

ZXXYy commented Dec 13, 2024