Add features based on file paths in the title and description #4270

benjaminmah · 2024-06-20T19:55:29Z

Resolves #4269.

Introduces new feature that uses file paths mentioned in the title and description of a bug and splits it into sub-paths and individual directories/files.

benjaminmah · 2024-06-20T19:59:34Z

Metrics of the newly trained model: metrics.log

bugbug/bug_features.py

suhaibmujahid

Do you see significant improvement when adding this feature?

benjaminmah · 2024-07-02T14:41:46Z

Do you see significant improvement when adding this feature?

I've previously attached the metrics of the model here:

Metrics of the newly trained model: metrics.log

Here are the metrics of the original/current model: metrics_original.log

There is a slight improvement (~ +1%) in each of the metrics.

bugbug/bug_features.py

marco-c

Looks good in general, but could you add a few tests for the new class?

benjaminmah · 2024-07-18T20:33:48Z

I've converted this PR to a draft, as I realized there still needs some polishing to do with the extraction of file paths. For example, there are cases where it may mistake a URL or a step (i.e. 1.Step 1, 2.Step2) as a file path. Once done, I'll be sure to add a few tests for this feature!

benjaminmah · 2024-07-19T20:29:17Z

Current metrics: metrics.log

Seems to perform slightly worse than the current model and python3 -m scripts.bug_classifier component --bug-id 1902245 classifies this bug as Core::Widget: Gtk (which is incorrect).

It is worth noting that the first instance of the file path feature model correctly classified the above bug as Core::Networking, despite it not 100% correctly retrieving the relevant file paths from the bug summary and description. Will continue to look into this.

bugbug/bug_features.py

benjaminmah · 2024-07-22T17:49:40Z

The current model now classifies python3 -m scripts.bug_classifier component --bug-id 1902245 correctly as Core::Networking. The metrics can be found here: metrics.log.

benjaminmah · 2024-07-24T15:52:18Z

Looks good in general, but could you add a few tests for the new class?

Added two tests here: a5a9c0f

benjaminmah · 2024-07-29T13:21:38Z

Seems like the tests failed, I'll do some revisions for these ASAP.

marco-c · 2024-08-01T14:35:13Z

What is the difference in average precision / recall? Is there any component which gets much better or much worse?

benjaminmah · 2024-08-02T17:16:25Z

What is the difference in average precision / recall? Is there any component which gets much better or much worse?

Here are the metrics from the model with the FilePaths feature included: new_model.log

Here are the metrics from the currently deployed model (which does not include the FilePaths feature): old_model.log

For the 0.9 CF, the precision increased by 0.02 and recall increased by 0.01.

Overall, there seems to be an increase in most metrics for specific product-component pairs, however feel free to consult the detailed metrics for the few cases where either the precision or recall dropped with the new model.

bugbug/bug_features.py

marco-c

Given your latest changes, was there any effect on the metrics?

marco-c · 2024-10-18T11:56:59Z

bugbug/bug_features.py

+        )
+
+        psl = PublicSuffixList()
+        tlds = set()


Suggested change

tlds = set()

tlds = set(f".{entry}" for entry in psl.tlds if "." not in entry)

Changed here: 75d1a23

marco-c · 2024-10-18T11:57:36Z

bugbug/bug_features.py

+    def remove_urls(self, text: str) -> str:
+        for keyword in self.non_file_path_keywords:
+            if keyword in text:
+                text = re.sub(r"\S*" + re.escape(keyword) + r"\S*", "", text)


Here and elsewhere, it'd be ideal to use f-strings instead of string addition

Changed here: 75d1a23

marco-c · 2024-10-18T12:00:02Z

tests/test_bug_features.py

+
+def test_FilePaths(read):
+    read(
+        "file_paths.json",


Given the fixture is small enough, you could just write the text here directly and remove the fixture. This will make it easier to follow the test.

Done here: 301783c

benjaminmah · 2024-10-18T18:49:36Z

Given your latest changes, was there any effect on the metrics?

Training the model with the file path feature included and excluded, I got the following results:

ct	Feature Inclusion	pre	rec	spe	f1	geo	iba	sup
Training Set	With File Path	0.95	0.95	1.00	0.95	0.97	0.95	73665
	Without File Path	0.95	0.95	1.00	0.95	0.98	0.95	73656
No CT	With File Path	0.64	0.63	0.99	0.62	0.78	0.60	8185
	Without File Path	0.63	0.62	0.99	0.61	0.77	0.59	8184
60% CT	With File Path	0.46	0.33	1.00	0.38	0.44	0.32	8185
	Without File Path	0.44	0.32	1.00	0.36	0.42	0.31	8184
70% CT	With File Path	0.47	0.32	1.00	0.37	0.42	0.30	8185
	Without File Path	0.45	0.30	1.00	0.36	0.41	0.29	8184
80% CT	With File Path	0.49	0.29	1.00	0.36	0.41	0.28	8185
	Without File Path	0.47	0.28	1.00	0.34	0.39	0.27	8184
90% CT	With File Path	0.50	0.26	1.00	0.33	0.38	0.25	8185
	Without File Path	0.48	0.25	1.00	0.32	0.36	0.24	8184

Overall, there seems to be a marginal increase in precision and recall when the file path feature is included.

…xers`

benjaminmah requested a review from suhaibmujahid June 20, 2024 19:59

benjaminmah marked this pull request as ready for review June 24, 2024 20:35

suhaibmujahid reviewed Jun 25, 2024

View reviewed changes

bugbug/bug_features.py Outdated Show resolved Hide resolved

bugbug/bug_features.py Outdated Show resolved Hide resolved

bugbug/bug_features.py Outdated Show resolved Hide resolved

bugbug/bug_features.py Outdated Show resolved Hide resolved

benjaminmah requested a review from suhaibmujahid June 25, 2024 19:55

suhaibmujahid reviewed Jul 2, 2024

View reviewed changes

benjaminmah requested a review from suhaibmujahid July 8, 2024 13:52

benjaminmah requested a review from marco-c July 17, 2024 18:43

marco-c reviewed Jul 18, 2024

View reviewed changes

bugbug/bug_features.py Outdated Show resolved Hide resolved

marco-c reviewed Jul 18, 2024

View reviewed changes

bugbug/bug_features.py Outdated Show resolved Hide resolved

marco-c requested changes Jul 18, 2024

View reviewed changes

benjaminmah marked this pull request as draft July 18, 2024 20:32

suhaibmujahid reviewed Jul 22, 2024

View reviewed changes

bugbug/bug_features.py Outdated Show resolved Hide resolved

bugbug/bug_features.py Outdated Show resolved Hide resolved

bugbug/bug_features.py Outdated Show resolved Hide resolved

benjaminmah marked this pull request as ready for review July 29, 2024 14:06

benjaminmah requested a review from marco-c August 1, 2024 14:27

marco-c reviewed Aug 5, 2024

View reviewed changes

bugbug/bug_features.py Outdated Show resolved Hide resolved

marco-c reviewed Aug 5, 2024

View reviewed changes

bugbug/bug_features.py Outdated Show resolved Hide resolved

marco-c reviewed Aug 5, 2024

View reviewed changes

bugbug/bug_features.py Show resolved Hide resolved

benjaminmah requested a review from suhaibmujahid September 4, 2024 19:57

benjaminmah requested a review from marco-c September 16, 2024 15:53

marco-c reviewed Oct 18, 2024

View reviewed changes

benjaminmah added 27 commits October 28, 2024 09:50

Replaced hard-coding programming language extensions with `pygment.le…

82a038a

…xers`

Fixed tests to reflect more file extensions

bfd6334

Added publicsuffix2 to generate list of tlds

5f4ec72

Replaced all addition strings with f-strings

1f3921b

Removed fixture from file path test

0cf2482

Fixed test errors

deadc18

Added custom delimiter

a3f0ede

Fixed json input

a96a1e2

Deleted fixture for file paths

176079c

Pre-compile regex

19a289c

Removed comment

4fd1f04

Changed default value of inline_data to None

0d8d9dd

Removed inline data boolean

c28ad97

Removed readlines()

24a5375

Converted results into a list

2bcfb18

Moved FilePaths test to function

9bed4a1

Fixed indentation

e76c0d0

Fixed assertion

1c00a10

Changed valid_extensions to a local variable instead of an attribute

f2a9d39

Converted non_file_path_keywords from attribute to local variable

b677ccb

Added comment explaining sorting valid_extensions

70f72f5

Removed deletion of URLs from string

836d42d

Removed sorting (test)

d6f8002

Removed sorting comment

e11be5b

Simplified updating valid extensions set with lexers

38432cf

Fixed ValueError

9373c0b

Fixed ValueError

2022eb4

benjaminmah force-pushed the file-path-features branch from 7eaab61 to 2022eb4 Compare October 28, 2024 13:53

Removed tracking

64e5c3b

benjaminmah requested a review from suhaibmujahid November 26, 2024 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add features based on file paths in the title and description #4270

Add features based on file paths in the title and description #4270

benjaminmah commented Jun 20, 2024

benjaminmah commented Jun 20, 2024

suhaibmujahid left a comment

benjaminmah commented Jul 2, 2024

marco-c left a comment

benjaminmah commented Jul 18, 2024 •

edited

Loading

benjaminmah commented Jul 19, 2024

benjaminmah commented Jul 22, 2024

benjaminmah commented Jul 24, 2024

benjaminmah commented Jul 29, 2024

marco-c commented Aug 1, 2024

benjaminmah commented Aug 2, 2024

marco-c left a comment

marco-c Oct 18, 2024

benjaminmah Oct 18, 2024

marco-c Oct 18, 2024

benjaminmah Oct 18, 2024

marco-c Oct 18, 2024

benjaminmah Oct 18, 2024

benjaminmah commented Oct 18, 2024

	tlds = set()
	tlds = set(f".{entry}" for entry in psl.tlds if "." not in entry)

Add features based on file paths in the title and description #4270

Are you sure you want to change the base?

Add features based on file paths in the title and description #4270

Conversation

benjaminmah commented Jun 20, 2024

benjaminmah commented Jun 20, 2024

suhaibmujahid left a comment

Choose a reason for hiding this comment

benjaminmah commented Jul 2, 2024

marco-c left a comment

Choose a reason for hiding this comment

benjaminmah commented Jul 18, 2024 • edited Loading

benjaminmah commented Jul 19, 2024

benjaminmah commented Jul 22, 2024

benjaminmah commented Jul 24, 2024

benjaminmah commented Jul 29, 2024

marco-c commented Aug 1, 2024

benjaminmah commented Aug 2, 2024

marco-c left a comment

Choose a reason for hiding this comment

marco-c Oct 18, 2024

Choose a reason for hiding this comment

benjaminmah Oct 18, 2024

Choose a reason for hiding this comment

marco-c Oct 18, 2024

Choose a reason for hiding this comment

benjaminmah Oct 18, 2024

Choose a reason for hiding this comment

marco-c Oct 18, 2024

Choose a reason for hiding this comment

benjaminmah Oct 18, 2024

Choose a reason for hiding this comment

benjaminmah commented Oct 18, 2024

benjaminmah commented Jul 18, 2024 •

edited

Loading