You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run SmartScraperGraph() using ollama with llama3.2 model but i am getting this warning that "Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024)." and the whole website is not being scraped.
#853
Open
GODCREATOR333 opened this issue
Dec 27, 2024
· 2 comments
smart_scraper_graph = SmartScraperGraph(
prompt="Extract me all the news from the website along with headlines",
source="https://www.bbc.com/",
config=graph_config
)
Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))
from langchain_community.callbacks.manager import get_openai_callback
You can use the langchain cli to automatically upgrade many imports. Please see documentation here https://python.langchain.com/docs/versions/v0_2/
from langchain.callbacks import get_openai_callback
Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024). Running this sequence through the model will result in indexing errors
{
"headlines": [
"Life is not easy - Haaland penalty miss sums up Man City crisis",
"How a 1990s Swan Lake changed dance forever"
],
"articles": [
{
"title": "BBC News",
"url": "https://www.bbc.com/news/world-europe-63711133"
},
{
"title": "Matthew Bourne on his male Swan Lake - the show that shook up the dance world",
"url": "https://www.bbc.com/culture/article/20241126-matthew-bourne-on-his-male-swan-lake-the-show-that-shook-up-the-dance-world-forever"
}
]
}
Even after specifying that model tokens = 4096 it is not effecting its maximum sequence length(1024). How can i increase it ? How can i chunk the website into size of its max_sequence_length so that i can scrape the whole website.
PS: Also having the option to further crawl the links and scrape subsequent websites would be great. Thanks
I am having this same issue as well. I don't think trying with OpenAI is a good resolution. In my experience OpenAI may provide better results given it is a proprietary model but it will be good to get this working with Open source local LLAMA models. I will greatly appreciate your help on this as well.
import json
from scrapegraphai.graphs import SmartScraperGraph
from ollama import Client
ollama_client = Client(host='http://localhost:11434')
Define the configuration for the scraping pipeline
graph_config = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0.0,
"format": "json",
"model_tokens": 4096,
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "nomic-embed-text",
},
}
Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="Extract me all the news from the website along with headlines",
source="https://www.bbc.com/",
config=graph_config
)
Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result,
indent=4))
Output*********************************************
Even after specifying that model tokens = 4096 it is not effecting its maximum sequence length(1024). How can i increase it ? How can i chunk the website into size of its max_sequence_length so that i can scrape the whole website.
PS: Also having the option to further crawl the links and scrape subsequent websites would be great. Thanks
Ubuntu 22.04 LTS
GPU : RTX 4070 12GB VRAM
RAM : 16GB DDR5
Ollama/Llama3.2:3B model
The text was updated successfully, but these errors were encountered: