Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add features based on file paths in the title and description #4270

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

benjaminmah
Copy link
Contributor

Resolves #4269.

Introduces new feature that uses file paths mentioned in the title and description of a bug and splits it into sub-paths and individual directories/files.

@benjaminmah
Copy link
Contributor Author

Metrics of the newly trained model: metrics.log

@benjaminmah benjaminmah marked this pull request as ready for review June 24, 2024 20:35
bugbug/bug_features.py Outdated Show resolved Hide resolved
bugbug/bug_features.py Outdated Show resolved Hide resolved
bugbug/bug_features.py Outdated Show resolved Hide resolved
bugbug/bug_features.py Outdated Show resolved Hide resolved
Copy link
Member

@suhaibmujahid suhaibmujahid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see significant improvement when adding this feature?

@benjaminmah
Copy link
Contributor Author

Do you see significant improvement when adding this feature?

I've previously attached the metrics of the model here:

Metrics of the newly trained model: metrics.log

Here are the metrics of the original/current model: metrics_original.log

There is a slight improvement (~ +1%) in each of the metrics.

@benjaminmah benjaminmah requested a review from suhaibmujahid July 8, 2024 13:52
@benjaminmah benjaminmah requested a review from marco-c July 17, 2024 18:43
bugbug/bug_features.py Outdated Show resolved Hide resolved
bugbug/bug_features.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@marco-c marco-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, but could you add a few tests for the new class?

@benjaminmah benjaminmah marked this pull request as draft July 18, 2024 20:32
@benjaminmah
Copy link
Contributor Author

benjaminmah commented Jul 18, 2024

I've converted this PR to a draft, as I realized there still needs some polishing to do with the extraction of file paths. For example, there are cases where it may mistake a URL or a step (i.e. 1.Step 1, 2.Step2) as a file path. Once done, I'll be sure to add a few tests for this feature!

@benjaminmah
Copy link
Contributor Author

Current metrics: metrics.log

Seems to perform slightly worse than the current model and python3 -m scripts.bug_classifier component --bug-id 1902245 classifies this bug as Core::Widget: Gtk (which is incorrect).

It is worth noting that the first instance of the file path feature model correctly classified the above bug as Core::Networking, despite it not 100% correctly retrieving the relevant file paths from the bug summary and description. Will continue to look into this.

bugbug/bug_features.py Outdated Show resolved Hide resolved
bugbug/bug_features.py Outdated Show resolved Hide resolved
bugbug/bug_features.py Outdated Show resolved Hide resolved
@benjaminmah
Copy link
Contributor Author

The current model now classifies python3 -m scripts.bug_classifier component --bug-id 1902245 correctly as Core::Networking. The metrics can be found here: metrics.log.

@benjaminmah
Copy link
Contributor Author

Looks good in general, but could you add a few tests for the new class?

Added two tests here: a5a9c0f

@benjaminmah
Copy link
Contributor Author

Seems like the tests failed, I'll do some revisions for these ASAP.

@benjaminmah benjaminmah marked this pull request as ready for review July 29, 2024 14:06
@benjaminmah benjaminmah requested a review from marco-c August 1, 2024 14:27
@marco-c
Copy link
Collaborator

marco-c commented Aug 1, 2024

What is the difference in average precision / recall? Is there any component which gets much better or much worse?

@benjaminmah
Copy link
Contributor Author

What is the difference in average precision / recall? Is there any component which gets much better or much worse?

Here are the metrics from the model with the FilePaths feature included: new_model.log

Here are the metrics from the currently deployed model (which does not include the FilePaths feature): old_model.log

For the 0.9 CF, the precision increased by 0.02 and recall increased by 0.01.

Overall, there seems to be an increase in most metrics for specific product-component pairs, however feel free to consult the detailed metrics for the few cases where either the precision or recall dropped with the new model.

bugbug/bug_features.py Outdated Show resolved Hide resolved
bugbug/bug_features.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@marco-c marco-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your latest changes, was there any effect on the metrics?

)

psl = PublicSuffixList()
tlds = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tlds = set()
tlds = set(f".{entry}" for entry in psl.tlds if "." not in entry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed here: 75d1a23

def remove_urls(self, text: str) -> str:
for keyword in self.non_file_path_keywords:
if keyword in text:
text = re.sub(r"\S*" + re.escape(keyword) + r"\S*", "", text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, it'd be ideal to use f-strings instead of string addition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed here: 75d1a23


def test_FilePaths(read):
read(
"file_paths.json",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the fixture is small enough, you could just write the text here directly and remove the fixture. This will make it easier to follow the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here: 301783c

@benjaminmah
Copy link
Contributor Author

Given your latest changes, was there any effect on the metrics?

Training the model with the file path feature included and excluded, I got the following results:

ct Feature Inclusion pre rec spe f1 geo iba sup
Training Set With File Path 0.95 0.95 1.00 0.95 0.97 0.95 73665
Without File Path 0.95 0.95 1.00 0.95 0.98 0.95 73656
No CT With File Path 0.64 0.63 0.99 0.62 0.78 0.60 8185
Without File Path 0.63 0.62 0.99 0.61 0.77 0.59 8184
60% CT With File Path 0.46 0.33 1.00 0.38 0.44 0.32 8185
Without File Path 0.44 0.32 1.00 0.36 0.42 0.31 8184
70% CT With File Path 0.47 0.32 1.00 0.37 0.42 0.30 8185
Without File Path 0.45 0.30 1.00 0.36 0.41 0.29 8184
80% CT With File Path 0.49 0.29 1.00 0.36 0.41 0.28 8185
Without File Path 0.47 0.28 1.00 0.34 0.39 0.27 8184
90% CT With File Path 0.50 0.26 1.00 0.33 0.38 0.25 8185
Without File Path 0.48 0.25 1.00 0.32 0.36 0.24 8184

Overall, there seems to be a marginal increase in precision and recall when the file path feature is included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[model:component] Add features based on file paths in the title and description
3 participants