-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ogc 508 replace elastic search by postgres v3 #1559
base: master
Are you sure you want to change the base?
Conversation
…o hybrid properties)
…o hybrid properties)
…c documents in python rather than psql
src/onegov/org/models/search.py
Outdated
func.setweight( | ||
func.to_tsvector( | ||
language, | ||
getattr(model.fts_idx_data, field, '')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the weighted vector bases on the static data from column fts_idx_data
generated upon update
or reindex
events.
With this approach no additional |
@Daverball Final review for postgres searching on separate views |
a95eef6
to
0eef94f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks fairly close to a first version we can deploy, there are however some engineering decisions that don't make sense to me and harm performance significantly, so I would like you to revisit those problem areas.
else: | ||
results = self.generic_search() | ||
|
||
return results[self.offset:self.offset + self.batch_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not ideal that we always retrieve all the results and then filter them. But I realize it may be difficult to do all the filtering and sorting in pure postgres and we'd still have to retrieve a full count of all the entries, so we're not saving so much in query time as we would in object translation overhead. But the latter may be significantly larger than the former for large result sets.
Fix typo Co-authored-by: David Salvisberg <[email protected]>
Remove unnecessary call `all()` Co-authored-by: David Salvisberg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was hinting at, you need to do this in two steps in separate locations, you can't do it in one function.
Filter polymorphic query by polymorphic identity for Searchable models Co-authored-by: David Salvisberg <[email protected]>
rework base model filter Co-authored-by: David Salvisberg <[email protected]>
@Daverball Could you please check my latest changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexer looks a lot better now. But I think we still have fundamental problems with the actual search. I think it's time to consider alternative architectures, such as a single shared table for the search metadata, so we can do all the counting/filtering/slicing of the entries in the database on a single query. So we only actually have to go out and fetch the models we're actually displaying results for on the current page.
This should mean we now only have one potentially expensive fts query, with the rest turning into simple "Fetch these primary keys from table X and these others from table Y" queries that should be very fast.
@cached_property | ||
def available_documents(self) -> int: | ||
if not self.number_of_docs: | ||
_ = self.load_batch_results | ||
return self.number_of_docs | ||
|
||
@cached_property | ||
def available_results(self) -> int: | ||
if not self.number_of_results: | ||
_ = self.load_batch_results | ||
return self.number_of_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't know what this means, this should be one number and it should be the same regardless. If there's a difference, there's a bug.
decay_rank = ( | ||
func.ts_rank_cd(model.fts_idx, ts_query, 0) * | ||
cast(func.pow(0.9, | ||
func.extract('epoch', | ||
func.now() - func.coalesce( | ||
cast( | ||
model.fts_idx_data[ | ||
'es_last_change'].astext, | ||
DateTime), | ||
func.now()) | ||
) / 86400), | ||
Numeric) | ||
).label('rank') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to add an index for this expression (although I'm not sure how well database indexes hold up for time-based expressions) since I suspect, that this is quite slow to compute, especially since it relies on JSON data and string to date conversions. The other thing you can do to speed this up would be to store the epoch, rather than a string timestamp in es_last_change
, so you don't need any data type conversions.
The other thing I'm not sure about is whether this will do the right thing always, since we store tz-naive dates into the database, NOW()
will return in the database's default timezone, which could be configured to something other than UTC, which makes this a bit fragile.
It will work for us right now, but I'd prefer something more portable and robust.
Maybe it would be even better to calculate this rank during indexing and then have a daily/weekly/monthly cronjob to re-calculate the search rank. I think this would be more than precise enough, since new entries and recently changed entries will all bunch up with the same high search rank.
This way we can also encode things like the custom event sorting into this rank and have to do less work after we get our results.
It honestly might be best to define a new table for searching at this point that contains all the search metadata and a reference pointing to the original model. That way we can perform a search using a singular query and can do the counting, sorting and slicing of that query entirely on the database. Having to do this in Python will slow down things by a lot for large instances with search terms that return many results.
self.search_models = { | ||
model for base in self.request.app.session_manager.bases | ||
for model in searchable_sqlalchemy_models(base)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not really correct for performing as few non-overlapping queries as possible. But if we go with a separate table for the search index, this should simplify away to some degree.
Search: Adds postgres search including views
/search-postgres?q=test
TYPE: Feature
LINK: ogc-508
Checklist