Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about Dataset Split and Training Implementation 数据集划分与训练实现中的潜在问题 #1

Open
ZXXYy opened this issue Dec 13, 2024 · 0 comments

Comments

@ZXXYy
Copy link

ZXXYy commented Dec 13, 2024

Thank you for open-sourcing the code for your paper! I have some questions regarding the implementation details that I'd appreciate clarification on:

  1. Potential Data Leakage
    In main.py/train(), the load_data function constructs the adj_matrix using the complete dataset (including the test set) for training. Could this implementation potentially leak test set information into the training process?

  2. Dataset Split Strategy
    The paper's Section IV-C states: "Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set."
    However, the current implementation:

  • First splits projects into train_projects and test_projects (80:20)
  • Then splits methods within test_projects into train and test (80:20)
    Could you elaborate on the rationale behind this different splitting strategy?
  1. Training Data Construction
    The paper mentions: "For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
    However, the implementation uses the first 4 invocations of test_methods as training data. Additionally, I'm curious about how the visible context is utilized during the evaluation phase.

Relevant code snippet:

def split_data(self, conf):
    """conf: C1.1 C1.2"""
    self.config = (int(conf[1]), int(conf[3]))
    np.random.seed(0)
    test_proj_id = set(np.random.choice(range(self.nb_proj), 
                      int(self.nb_proj*0.2), replace=False))
    ...
     for pid in test_proj_id:
            size = len(self.proj_have_users[pid])
            # print('test pid and user size', pid, size)
            if self.config[0] == 1:  # remove half user methods
                user_id = self.proj_have_users[pid][: size//2]
            elif self.config[0] == 2:  # keep all user methods
                user_id = self.proj_have_users[pid]
            if self.config[1] == 2:  # retain 4 invocations
                # use 0.2 percent methods per project as active methods for test
                gt_users = get_test_user(user_id, 5)  # users having more than 5 invocations
                test_cnt = len(gt_users) - int(len(gt_users)*0.8)
                add_to_test(gt_users, test_cnt, 4)
            if self.config[1] == 1:  # reserve the first invocation
                gt_users = get_test_user(user_id, 4)
                test_cnt = len(user_id) - int(len(user_id)*0.8)
                add_to_test(gt_users, test_cnt, 1)
      ...
      for pid in range(self.nb_proj):
                  if pid in test_proj_id:
                      continue
                  for uid in self.proj_have_users[pid]:
                      self.train_dict[uid] = self.invocation_mx[uid]

I would greatly appreciate your insights on these implementation details.


感谢开源这篇论文“Graph Neural Network Based Collaborative Filtering for API Usage Recommendation”的代码!
根据论文Methodology的描述以及代码实现,我发现了几个潜在问题,希望得到确认和讨论:

  1. 数据泄露问题
    在main.py/train()中,load_data函数使用了完整数据集(包括测试集)构建adj_matrix,并基于此矩阵进行训练。这种实现方式是否导致了测试集信息的泄露?
  2. 数据集划分方式与论文描述不一致
    论文Section IV-C描述:"Specifically, we consider the developer is working at a project p, and the methods of p are split into training set, validation set and test set." 但当前代码实现是:
    • 先将projects按8:2分为train_projects和test_projects
    • 再将test_projects中的methods按8:2分为train和test
      这种划分方式是否合理?能否解释采用这种不同划分策略的原因?
  3. 训练数据构造的合理性
    论文提到:"For each method q in the test set, we keep the first π API invocations as visible context and the rest invocations are taken as the ground truth GT (q)."
    但代码中将test_method的前4个invocation也用作训练数据。这种处理方式是否合理?此外,visible context在evaluation阶段是如何体现的?

期待能得到对这些问题的解答和讨论!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant