Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ogc 508 replace elastic search by postgres v3 #1559

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

Tschuppi81
Copy link
Contributor

@Tschuppi81 Tschuppi81 commented Oct 25, 2024

Search: Adds postgres search including views /search-postgres?q=test

TYPE: Feature
LINK: ogc-508

Checklist

  • I have performed a self-review of my code
  • I considered adding a reviewer
  • I made changes/features for both org and town6, agency, fsi, translator, winterthur, feriennet
  • I have tested my code thoroughly by hand
  • I have added tests for my changes/features

Copy link

linear bot commented Oct 25, 2024

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 85.60606% with 38 lines in your changes missing coverage. Please review.

Project coverage is 88.70%. Comparing base (10c9a81) to head (95abc04).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/onegov/org/models/search.py 80.68% 28 Missing ⚠️
src/onegov/org/views/search.py 68.18% 7 Missing ⚠️
src/onegov/fsi/views/search.py 92.85% 1 Missing ⚠️
src/onegov/search/cli.py 0.00% 1 Missing ⚠️
src/onegov/search/integration.py 91.66% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/onegov/agency/views/search.py 100.00% <100.00%> (ø)
src/onegov/directory/models/directory_entry.py 95.31% <ø> (ø)
src/onegov/landsgemeinde/views/search.py 100.00% <100.00%> (ø)
src/onegov/onboarding/app.py 100.00% <100.00%> (ø)
src/onegov/onboarding/models/town_assistant.py 93.54% <100.00%> (ø)
src/onegov/org/app.py 97.81% <100.00%> (ø)
src/onegov/org/cronjobs.py 92.83% <ø> (ø)
src/onegov/org/layout.py 91.46% <100.00%> (+0.03%) ⬆️
src/onegov/org/models/__init__.py 100.00% <100.00%> (ø)
src/onegov/org/models/ticket.py 88.67% <ø> (ø)
... and 12 more

... and 5 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10c9a81...95abc04. Read the comment docs.

func.setweight(
func.to_tsvector(
language,
getattr(model.fts_idx_data, field, '')),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the weighted vector bases on the static data from column fts_idx_data generated upon update or reindex events.

@Tschuppi81
Copy link
Contributor Author

With this approach no additional hybrid_properties are needed.

@Tschuppi81 Tschuppi81 marked this pull request as ready for review October 25, 2024 19:40
@Tschuppi81 Tschuppi81 requested a review from Daverball October 25, 2024 19:40
@Tschuppi81
Copy link
Contributor Author

@Daverball Final review for postgres searching on separate views /search-postgres?q=test (not yet productive)

@Tschuppi81 Tschuppi81 force-pushed the ogc-508-replace-elastic-search-by-postgres-v3 branch from a95eef6 to 0eef94f Compare November 7, 2024 14:45
Copy link
Member

@Daverball Daverball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fairly close to a first version we can deploy, there are however some engineering decisions that don't make sense to me and harm performance significantly, so I would like you to revisit those problem areas.

src/onegov/org/models/search.py Outdated Show resolved Hide resolved
src/onegov/org/models/search.py Show resolved Hide resolved
src/onegov/org/models/search.py Outdated Show resolved Hide resolved
else:
results = self.generic_search()

return results[self.offset:self.offset + self.batch_size]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal that we always retrieve all the results and then filter them. But I realize it may be difficult to do all the filtering and sorting in pure postgres and we'd still have to retrieve a full count of all the entries, so we're not saving so much in query time as we would in object translation overhead. But the latter may be significantly larger than the former for large result sets.

src/onegov/org/models/search.py Outdated Show resolved Hide resolved
src/onegov/org/models/search.py Outdated Show resolved Hide resolved
src/onegov/org/models/search.py Outdated Show resolved Hide resolved
src/onegov/org/models/search.py Outdated Show resolved Hide resolved
src/onegov/search/indexer.py Outdated Show resolved Hide resolved
src/onegov/search/integration.py Outdated Show resolved Hide resolved
Copy link
Member

@Daverball Daverball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I was hinting at, you need to do this in two steps in separate locations, you can't do it in one function.

src/onegov/search/integration.py Outdated Show resolved Hide resolved
src/onegov/search/utils.py Outdated Show resolved Hide resolved
@Tschuppi81
Copy link
Contributor Author

@Daverball Could you please check my latest changes?

Copy link
Member

@Daverball Daverball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexer looks a lot better now. But I think we still have fundamental problems with the actual search. I think it's time to consider alternative architectures, such as a single shared table for the search metadata, so we can do all the counting/filtering/slicing of the entries in the database on a single query. So we only actually have to go out and fetch the models we're actually displaying results for on the current page.

This should mean we now only have one potentially expensive fts query, with the rest turning into simple "Fetch these primary keys from table X and these others from table Y" queries that should be very fast.

Comment on lines +199 to +209
@cached_property
def available_documents(self) -> int:
if not self.number_of_docs:
_ = self.load_batch_results
return self.number_of_docs

@cached_property
def available_results(self) -> int:
if not self.number_of_results:
_ = self.load_batch_results
return self.number_of_results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't know what this means, this should be one number and it should be the same regardless. If there's a difference, there's a bug.

Comment on lines +306 to +318
decay_rank = (
func.ts_rank_cd(model.fts_idx, ts_query, 0) *
cast(func.pow(0.9,
func.extract('epoch',
func.now() - func.coalesce(
cast(
model.fts_idx_data[
'es_last_change'].astext,
DateTime),
func.now())
) / 86400),
Numeric)
).label('rank')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to add an index for this expression (although I'm not sure how well database indexes hold up for time-based expressions) since I suspect, that this is quite slow to compute, especially since it relies on JSON data and string to date conversions. The other thing you can do to speed this up would be to store the epoch, rather than a string timestamp in es_last_change, so you don't need any data type conversions.

The other thing I'm not sure about is whether this will do the right thing always, since we store tz-naive dates into the database, NOW() will return in the database's default timezone, which could be configured to something other than UTC, which makes this a bit fragile.

It will work for us right now, but I'd prefer something more portable and robust.

Maybe it would be even better to calculate this rank during indexing and then have a daily/weekly/monthly cronjob to re-calculate the search rank. I think this would be more than precise enough, since new entries and recently changed entries will all bunch up with the same high search rank.

This way we can also encode things like the custom event sorting into this rank and have to do less work after we get our results.

It honestly might be best to define a new table for searching at this point that contains all the search metadata and a reference pointing to the original model. That way we can perform a search using a singular query and can do the counting, sorting and slicing of that query entirely on the database. Having to do this in Python will slow down things by a lot for large instances with search terms that return many results.

Comment on lines +195 to +197
self.search_models = {
model for base in self.request.app.session_manager.bases
for model in searchable_sqlalchemy_models(base)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not really correct for performing as few non-overlapping queries as possible. But if we go with a separate table for the search index, this should simplify away to some degree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants