I am trying to run SmartScraperGraph() using ollama with llama3.2 model but i am getting this warning that  "Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024)." and the whole website is not being scraped.

import json
from scrapegraphai.graphs import SmartScraperGraph
from ollama import Client

ollama_client = Client(host='http://localhost:11434')

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "temperature": 0.0,
        "format": "json",
        "model_tokens": 4096,
        "base_url": "http://localhost:11434",
    },
    "embeddings": {
        "model": "nomic-embed-text",
    },
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me all the news from the website along with headlines",
    source="https://www.bbc.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, `indent=4))`

Output*********************************************
>> from langchain_community.callbacks.manager import get_openai_callback
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here <https://python.langchain.com/docs/versions/v0_2/>
  from langchain.callbacks import get_openai_callback
Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024). Running this sequence through the model will result in indexing errors
{
    "headlines": [
        "Life is not easy - Haaland penalty miss sums up Man City crisis",
        "How a 1990s Swan Lake changed dance forever"
    ],
    "articles": [
        {
            "title": "BBC News",
            "url": "https://www.bbc.com/news/world-europe-63711133"
        },
        {
            "title": "Matthew Bourne on his male Swan Lake - the show that shook up the dance world",
            "url": "https://www.bbc.com/culture/article/20241126-matthew-bourne-on-his-male-swan-lake-the-show-that-shook-up-the-dance-world-forever"
        }
    ]
}


Even after specifying that model tokens = 4096 it is not effecting its maximum sequence length(1024). How can i increase it ? How can i chunk the website into size of  its max_sequence_length so that i can scrape the whole website.

PS: Also having the option to further crawl the links and scrape subsequent websites would be great. Thanks 

Ubuntu 22.04 LTS
GPU : RTX 4070 12GB VRAM
RAM : 16GB DDR5

Ollama/Llama3.2:3B model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

I am trying to run SmartScraperGraph() using ollama with llama3.2 model but i am getting this warning that "Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024)." and the whole website is not being scraped. #853

Define the configuration for the scraping pipeline

Create the SmartScraperGraph instance

Run the pipeline

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

I am trying to run SmartScraperGraph() using ollama with llama3.2 model but i am getting this warning that "Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024)." and the whole website is not being scraped. #853

Description

Define the configuration for the scraping pipeline

Create the SmartScraperGraph instance

Run the pipeline

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions