Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for the negative datasets #1

Open
zw-SIMM opened this issue Oct 19, 2024 · 4 comments
Open

Request for the negative datasets #1

zw-SIMM opened this issue Oct 19, 2024 · 4 comments

Comments

@zw-SIMM
Copy link

zw-SIMM commented Oct 19, 2024

Great work!
However, I’m a bit confused about the negative samples in your work,
even with the negative preparation codes provided.
Is the ratio of positive to negative samples set at 1:1000 for both training and testing?
Could you also provide the negative datasets as a benchmark for reproduction and comparision fairly?

Negative Sample. A common method involves designating all enzymes within a training set that are not annotated for catalyzing a specific reaction as negative samples [51]. Nevertheless, given the extensive size of our dataset, we opt for a strategy centered on enzyme and reaction similarity to construct negative samples. Specifically, for each verified positive enzyme-reaction pair, we identify the top-k enzymes that closely resemble the positive enzyme but do not have annotations for catalyzing the reaction, using them as negative samples. Similarly, we select the top-k reactions that are similar to the positive reaction but are not catalyzed by the positive enzyme, to serve as additional negative samples (k=1000). This method effectively narrows down the size of negative samples while retaining those of significance for both training and testing purposes. Despite our approach, the construction of negative samples still presents an unresolved challenge, remaining as an open question for future development.
@WillHua127
Copy link
Owner

WillHua127 commented Oct 19, 2024

Thanks for your interests. The negative dataset is approximately more than 10GB, that why we didnt choose to upload, it is just too much. You can create your own negative samples using mutations, or treating unseen enzyme-reaction pairs as negative samples, or using homology alignments.

@zw-SIMM
Copy link
Author

zw-SIMM commented Oct 20, 2024

Thanks for your interests. The negative dataset is approximately more than 10GB, that why we didnt choose to upload, it is just too much. You can create your own negative samples using mutations, or treating unseen enzyme-reaction pairs as negative samples, or using homology alignments.


Thanks for your reply. I understand that the negative dataset is large (>10GB), and uploading it may not be feasible. However, to better replicate your results and ensure alignment with your experimental settings, I would like to confirm few points:

Positive-to-Negative Sample Ratio:
Could you confirm whether the ratio of positive to negative samples is 1:1000 or 1:2000? Specifically, does each sequence or molecule have 1000 negative samples?

Negative Sample Generation Script:
While I see that prepare_negative.py generates dictionaries similar to sequence or molecule data, it isn't clear how to directly generate the complete negative samples used in your experiments. Could you provide the full script or detailed instructions for this step?

Thank you again for your excellent work and support. I look forward to your further guidance!

@WillHua127
Copy link
Owner

You dont need exact negative samples to reproduce our results because our results are retrieval based, i.e., using only positive samples in evaluation. If you want to duplicate the ratio, it is 1:1000 for both sequence and molecule.

@zw-SIMM
Copy link
Author

zw-SIMM commented Oct 25, 2024

You dont need exact negative samples to reproduce our results because our results are retrieval based, i.e., using only positive samples in evaluation. If you want to duplicate the ratio, it is 1:1000 for both sequence and molecule.

Thanks for your reply again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants