-
Notifications
You must be signed in to change notification settings - Fork 103
Added how-to find mode value in SArray #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
guy4261
wants to merge
4
commits into
turi-code:master
Choose a base branch
from
guy4261:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
import graphlab as gl | ||
|
||
def mode_sa(sa, single_mode=True): | ||
"""Return a mode of sa, or all modes if there are several. | ||
|
||
single_mode: whether to return a single mode or an SArray of all modes (default: True).""" | ||
|
||
if len(sa) == 0: | ||
raise ValueError("Can't find mode(s) in empty SArray") | ||
|
||
sf = gl.SFrame({"value": sa}) | ||
sf2 = sf.groupby("value", {"count": gl.aggregate.COUNT()}) | ||
max_count_index = sf2["count"].argmax() | ||
|
||
if single_mode: | ||
return sf2[max_count_index]["value"] | ||
|
||
else: | ||
max_count = sf2[max_count_index]["count"] | ||
return sf2[sf2["count"] == max_count]["value"] | ||
|
||
|
||
# Create an SArray with two modes (most-common elements: 2 and 3) | ||
sa = gl.SArray([1, 2, 2, 3, 3]) | ||
|
||
# Find one of the modes | ||
single_mode = mode_sa(sa) # returns 2 | ||
|
||
# Find all modes | ||
all_modes = mode_sa(sa, single_mode=False) | ||
# Returns | ||
# dtype: int | ||
# Rows: 2 | ||
# [2, 3] | ||
|
||
|
||
# A faster (albeit maybe less accurate) way to find the mode value is using sa.sketch_summary().frequent_items() . | ||
# There are two caveats to this approach: | ||
# 1. won't work for very low-frequency mode values, and | ||
# 2. won't necessarily give the correct result if there are multiple likely candidates. | ||
|
||
def sketch_mode_sa(sa, single_mode=True): | ||
"""Fast (albeit less accurate) way to find the mode value(s) of SArray sa. | ||
|
||
single_mode: whether to return a single mode or an SArray of all modes (default: True).""" | ||
|
||
if len(sa) == 0: | ||
raise ValueError("Can't find mode(s) in empty SArray") | ||
|
||
frequent_items_sketch = sa.sketch_summary().frequent_items() | ||
modes_sketch = [k for (k, v) in frequent_items_sketch.iteritems() | ||
if v == max(frequent_items_sketch.itervalues())] | ||
return modes_sketch[0] if single_mode else modes_sketch | ||
|
||
sketch_mode_sa(sa) # returns 2 | ||
sketch_mode_sa(sa, single_mode=False) # returns [2, 3] | ||
|
||
|
||
# Both approaches should handle empty SArrays. | ||
# The implementations above will simply raise a ValueError if `sa` is empty. | ||
try: | ||
mode_sa(gl.SArray([])) | ||
except ValueError: | ||
pass | ||
|
||
try: | ||
sketch_mode_sa(gl.SArray([])) | ||
except ValueError: | ||
pass |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be slow if the list of frequent items is long (and there is not really an upper bound to how long it can be). Here it calls
itervalues
for each item of the dict and max on that. For perf this should be refactored to use a single call to max(itervalues) and a comparison to that stored value.