Replies: 1 comment
-
Hi, That's not a stupid question at all! The crawler in stract is a batched crawler that only runs when you specifically run it, so the index won't grow unless you manually run the crawler and build the index over this new data. Technically there is also a live index that continously polls some news sites and blogs, but you probably don't need this. I've tried to write a rough guideline for how to get started here. The crawler doesn't support limiting to specific sites, but all parts of the index can be built from .warc files so in theory any crawler that stores the result in this format can be used to build the index. I haven't used it myself, but you might want to look into apache nutch which I think can be limited to specific sites. Alternatively I'm pretty sure you can export commoncrawl data based on the language/tld you want. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I have a few questions about your project. I haven't fully explored the project yet and I don't have much experience with search engines. Sorry if my question is stupid.
1- Is there a specific reference for the first index? For example, can I specify the first index reference myself without using the sample.warc.gz file?
2- Does the index grow over time? Or is it possible to intervene? That is, can we index the site manually?
3- One of the reasons I ask these questions is that the number of languages you currently support in searches is low. I would like to create or start a search index for a language that is not on the list.
I hope I was able to explain it clearly. Thank you in advance for your understanding.
Beta Was this translation helpful? Give feedback.
All reactions