-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add features based on file paths in the title and description #4270
base: master
Are you sure you want to change the base?
Conversation
Metrics of the newly trained model: metrics.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see significant improvement when adding this feature?
I've previously attached the metrics of the model here:
Here are the metrics of the original/current model: metrics_original.log There is a slight improvement (~ +1%) in each of the metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, but could you add a few tests for the new class?
I've converted this PR to a draft, as I realized there still needs some polishing to do with the extraction of file paths. For example, there are cases where it may mistake a URL or a step (i.e. 1.Step 1, 2.Step2) as a file path. Once done, I'll be sure to add a few tests for this feature! |
Current metrics: metrics.log Seems to perform slightly worse than the current model and It is worth noting that the first instance of the file path feature model correctly classified the above bug as |
The current model now classifies |
Added two tests here: a5a9c0f |
Seems like the tests failed, I'll do some revisions for these ASAP. |
What is the difference in average precision / recall? Is there any component which gets much better or much worse? |
Here are the metrics from the model with the Here are the metrics from the currently deployed model (which does not include the For the 0.9 CF, the precision increased by 0.02 and recall increased by 0.01. Overall, there seems to be an increase in most metrics for specific product-component pairs, however feel free to consult the detailed metrics for the few cases where either the precision or recall dropped with the new model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given your latest changes, was there any effect on the metrics?
bugbug/bug_features.py
Outdated
) | ||
|
||
psl = PublicSuffixList() | ||
tlds = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tlds = set() | |
tlds = set(f".{entry}" for entry in psl.tlds if "." not in entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed here: 75d1a23
bugbug/bug_features.py
Outdated
def remove_urls(self, text: str) -> str: | ||
for keyword in self.non_file_path_keywords: | ||
if keyword in text: | ||
text = re.sub(r"\S*" + re.escape(keyword) + r"\S*", "", text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere, it'd be ideal to use f-strings instead of string addition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed here: 75d1a23
tests/test_bug_features.py
Outdated
|
||
def test_FilePaths(read): | ||
read( | ||
"file_paths.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the fixture is small enough, you could just write the text here directly and remove the fixture. This will make it easier to follow the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here: 301783c
Training the model with the file path feature included and excluded, I got the following results:
Overall, there seems to be a marginal increase in precision and recall when the file path feature is included. |
7eaab61
to
2022eb4
Compare
Resolves #4269.
Introduces new feature that uses file paths mentioned in the title and description of a bug and splits it into sub-paths and individual directories/files.