Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

sort symbols in order of frequency rather than lexicographically #280

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Feb 9, 2018

for #249

as we keep track of symbols in head we now also keep track of how many times we've seen that symbol, and when we write the symbols in IndexWriter we sort the symbols in order of frequency seen before writing them

@fabxc
Copy link
Contributor

fabxc commented Feb 9, 2018

Thanks for working on this! Currently we appear to have some severe bugs in Prometheus 2.1 tied to the storage. Thus, I'd suggest we freeze any new features to prometheus/tsdb until things go back to being stable.

So this PR will probably be on hold for a bit.

@cstyan
Copy link
Contributor Author

cstyan commented Feb 9, 2018

@fabxc no problem! I can pick this (#249 ) back up when we're in a more stable state. Is there anything I can help with regarding the bugs in 2.1?

@cstyan
Copy link
Contributor Author

cstyan commented Feb 23, 2018

@fabxc will the 2.2.0 release unblock this?

@cstyan cstyan force-pushed the callum-symbols-sort branch from 598c024 to 24b0863 Compare March 8, 2018 20:21
@cstyan cstyan force-pushed the callum-symbols-sort branch from 24b0863 to 508d576 Compare March 17, 2018 20:37
@cstyan
Copy link
Contributor Author

cstyan commented Mar 17, 2018

rebased off master, fixed a merge conflict that I'd missed when I pushed last

Copy link
Collaborator

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll run it locally for a while and report back.

head.go Outdated

for s := range h.head.symbols {
res[s] = struct{}{}
res[s] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 0? It should be:

for s, num := range h.head.symbols {
    res[s] = num
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had a reason, I don't remember at this point :)

@cstyan cstyan force-pushed the callum-symbols-sort branch from 508d576 to 95666e0 Compare April 1, 2018 20:19
@cstyan
Copy link
Contributor Author

cstyan commented Apr 1, 2018

changed the value assigned to the symbols in Reader in head.go and squashed it into the original commit

let me know if there's anything I can do to help test this :)

@cstyan cstyan force-pushed the callum-symbols-sort branch from 95666e0 to 00a6d1c Compare May 23, 2018 17:10
@cstyan
Copy link
Contributor Author

cstyan commented May 23, 2018

@gouthamve @fabxc are these changes still relevant?

@krasi-georgiev
Copy link
Contributor

Shouldn't you sort the symbols in the index writer , before writing it to disk?
IIUC this is where we want to save some space.

https://github.com/prometheus/tsdb/blob/c848349f07c83bd38d5d19faa5ea71c7fd8923ea/index/index.go#L343
instead of this , sort by popularity.

@cstyan
Copy link
Contributor Author

cstyan commented Jun 12, 2018

@krasi-georgiev I'll have to double check, haven't looked at this in a while

@cstyan cstyan force-pushed the callum-symbols-sort branch from 00a6d1c to 5088a2c Compare September 27, 2018 22:58
@cstyan
Copy link
Contributor Author

cstyan commented Sep 27, 2018

@krasi-georgiev is that not what is happening here: https://github.com/prometheus/tsdb/pull/280/files#diff-71ebe2bcf31a915b1fa3b3b289d5d31dR354 ?

rebased off master to fix the conflict in head.go

@cstyan cstyan force-pushed the callum-symbols-sort branch from 5088a2c to 36cbad4 Compare September 27, 2018 23:02
@krasi-georgiev
Copy link
Contributor

failing tests

@cstyan cstyan force-pushed the callum-symbols-sort branch from 36cbad4 to 21bde8c Compare October 3, 2018 21:16
Copy link
Contributor

@krasi-georgiev krasi-georgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test how much are the savings by this change?

Also I don't see any test to ensure the behaviour that Symbols are saved ordered by frequency.
something like

  • Add some symbols
  • Save index
  • Read index
  • Check Symbols order.

index/index.go Outdated Show resolved Hide resolved
@krasi-georgiev
Copy link
Contributor

@cstyan cstyan force-pushed the callum-symbols-sort branch from 21bde8c to 52dadbc Compare January 5, 2019 01:05
@cstyan
Copy link
Contributor Author

cstyan commented Jan 5, 2019

Added a test for the sorting of symbols

I'm not sure if we want to get back the frequency #'s when we read the symbols back out of the table?

@brian-brazil
Copy link
Contributor

I don't see a reason to expose that in the API.

@krasi-georgiev
Copy link
Contributor

failing tests

also

Can you test how much are the savings by this change?

@cstyan cstyan force-pushed the callum-symbols-sort branch from 1664379 to 7e9131d Compare January 7, 2019 02:34
@cstyan
Copy link
Contributor Author

cstyan commented Jan 7, 2019

@krasi-georgiev

Can you test how much are the savings by this change?

Yes I'll have a look at, I guess including a benchmark test? But if you read #249 the goal is to reduce the size of the index file.

@brian-brazil

I don't see a reason to expose that in the API.

I'll have to double check. When I was reading the use of the index and block reader to get the symbols, when compaction happens we read the current symbols, determine which are still in use, and then write those to a new index. In that case we would want the frequencies to persist across that write. But I've probably just misread what's happening.

@krasi-georgiev
Copy link
Contributor

Yes I'll have a look at, I guess including a benchmark test? But if you read #249 the goal is to reduce the size of the index file.

yes this is what I meant , how much is the index file size reduced by this change.

@krasi-georgiev
Copy link
Contributor

@cstyan would you have time to continue with this?
Otherwise I can try to find some time to continue from here.

@cstyan
Copy link
Contributor Author

cstyan commented Jan 25, 2019

@krasi-georgiev yeah I should have some time next week, if you wanted to try something before then feel free.

Krasi Georgiev added 2 commits February 4, 2019 11:39
@krasi-georgiev
Copy link
Contributor

updated to the latest master and resolved the conflicts.

Now will run some tests locally to compare the index file savings with this change.

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Feb 4, 2019

using the following test I don't see any difference in the index file size.

The index file size with or without the changes in this PR is 180Mb.

func TestIndexSize(t *testing.T) {
	tmpdir, err := ioutil.TempDir("", "testIndex")
	testutil.Ok(t, err)
	var dirs []string

	for i := int64(0); i < 5; i++ {
		dirs = append(dirs, createBlock(t, tmpdir, genSeries(1000, 10, i*1000, i*1000+1000)))

	}
	compactor, err := NewLeveledCompactor(nil, log.NewNopLogger(), []int64{1000000}, nil)
	testutil.Ok(t, err)

	uid, err := compactor.Compact(tmpdir, dirs, nil)
	testutil.Ok(t, err)
	fmt.Println(tmpdir + "/" + uid.String())
}

This generates random series so it should generate enough churn.

@krasi-georgiev
Copy link
Contributor

ping @cstyan

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants