New site index #234

akelyasir · 2024-11-05T22:50:48Z

akelyasir
Nov 5, 2024

Hi,

I have a few questions about your project. I haven't fully explored the project yet and I don't have much experience with search engines. Sorry if my question is stupid.

1- Is there a specific reference for the first index? For example, can I specify the first index reference myself without using the sample.warc.gz file?
2- Does the index grow over time? Or is it possible to intervene? That is, can we index the site manually?
3- One of the reasons I ask these questions is that the number of languages you currently support in searches is low. I would like to create or start a search index for a language that is not on the list.

I hope I was able to explain it clearly. Thank you in advance for your understanding.

mikkeldenker · 2024-11-06T08:27:42Z

mikkeldenker
Nov 6, 2024
Maintainer

Hi,

That's not a stupid question at all! The crawler in stract is a batched crawler that only runs when you specifically run it, so the index won't grow unless you manually run the crawler and build the index over this new data. Technically there is also a live index that continously polls some news sites and blogs, but you probably don't need this. I've tried to write a rough guideline for how to get started here.

The crawler doesn't support limiting to specific sites, but all parts of the index can be built from .warc files so in theory any crawler that stores the result in this format can be used to build the index. I haven't used it myself, but you might want to look into apache nutch which I think can be limited to specific sites. Alternatively I'm pretty sure you can export commoncrawl data based on the language/tld you want.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New site index #234

{{title}}

Replies: 1 comment

{{title}}

Select a reply

New site index #234

akelyasir Nov 5, 2024

Replies: 1 comment

mikkeldenker Nov 6, 2024 Maintainer

akelyasir
Nov 5, 2024

mikkeldenker
Nov 6, 2024
Maintainer