A Playwright-based web crawler component for Fess that enables JavaScript-rendered web page crawling capabilities. This extension integrates Microsoft Playwright with the Fess crawler framework to handle modern web applications that require JavaScript execution for content rendering.
- Multi-browser Support: Compatible with Chromium, Firefox, and WebKit browsers
- JavaScript Rendering: Full support for SPAs and JavaScript-heavy websites
- Authentication Integration: Seamless integration with Fess's authentication system
- Proxy Configuration: Built-in proxy support with bypass patterns
- SSL Flexibility: Option to ignore SSL certificate validation for testing
- File Downloads: Handles various content types including PDF, images, documents
- Resource Management: Efficient browser context sharing and cleanup
- Configurable Rendering States: Control when to extract content (load, DOMContentLoaded, networkidle)
- Java: 21+
- Maven: 3.x
- Microsoft Playwright: Latest version for browser automation
- Fess Crawler: Core crawler framework
- Apache Commons Pool: Connection pooling
- OpenSearch: Search engine integration (provided scope)
- JUnit + UTFlute: Testing framework
- Java 21 or higher
- Maven 3.x
- Node.js and npm (for Playwright browser installation)
- Fess parent POM dependency
First, install the required parent POM dependency:
git clone https://github.com/codelibs/fess-parent.git
cd fess-parent
mvn install -Dgpg.skip=true
Install the required browser binaries:
npx playwright install --with-deps
Clone and build the project:
git clone https://github.com/codelibs/fess-crawler-playwright.git
cd fess-crawler-playwright
mvn clean package
import org.codelibs.fess.crawler.client.http.PlaywrightClient;
import org.codelibs.fess.crawler.client.http.PlaywrightClientCreator;
// Create and configure the client
PlaywrightClient client = new PlaywrightClient();
client.setBrowserName("chromium"); // or "firefox", "webkit"
client.init();
// Use with Fess crawler
RequestData requestData = RequestDataBuilder.newRequestData()
.get()
.url("https://example.com")
.build();
ResponseData responseData = client.execute(requestData);
Add to your pom.xml
:
<dependency>
<groupId>org.codelibs.fess</groupId>
<artifactId>fess-crawler-playwright</artifactId>
<version>15.2.0-SNAPSHOT</version>
</dependency>
PlaywrightClient client = new PlaywrightClient();
// Set browser type
client.setBrowserName("chromium"); // chromium, firefox, webkit
// Configure launch options
LaunchOptions launchOptions = new LaunchOptions()
.setHeadless(true)
.setTimeout(30000);
client.setLaunchOptions(launchOptions);
// Set rendering state
client.setRenderedState(LoadState.NETWORKIDLE);
// Configure timeouts
client.setDownloadTimeout(15000); // 15 seconds
client.setCloseTimeout(15000); // 15 seconds
// The client automatically integrates with Fess's HcHttpClient for authentication
// Configure authentication in your Fess crawler configuration
NewContextOptions contextOptions = new NewContextOptions()
.setUserAgent("CustomUserAgent/1.0")
.setExtraHTTPHeaders(Map.of("Authorization", "Bearer token"));
client.setNewContextOptions(contextOptions);
// Set proxy through system properties
System.setProperty("http.proxyHost", "proxy.example.com");
System.setProperty("http.proxyPort", "8080");
System.setProperty("fess.crawler.playwright.proxy.bypass", "*.local,127.0.0.1");
// Or configure directly
client.addOption("proxyHost", "proxy.example.com");
client.addOption("proxyPort", "8080");
// Ignore SSL certificate errors for testing
client.addOption("ignoreHttpsErrors", "true");
PlaywrightClient client = new PlaywrightClient();
client.setBrowserName("chromium");
client.setRenderedState(LoadState.NETWORKIDLE); // Wait for network to be idle
client.init();
RequestData requestData = RequestDataBuilder.newRequestData()
.get()
.url("https://spa-example.com")
.build();
ResponseData responseData = client.execute(requestData);
String content = new String(responseData.getResponseBody(), responseData.getCharSet());
PlaywrightClient client = new PlaywrightClient();
client.setBrowserName("chromium");
client.setDownloadTimeout(30000); // 30 seconds for large files
client.init();
// The client automatically handles downloads for PDF, images, documents, etc.
RequestData requestData = RequestDataBuilder.newRequestData()
.get()
.url("https://example.com/document.pdf")
.build();
ResponseData responseData = client.execute(requestData);
// File content is available in responseData.getResponseBody()
src/
├── main/java/org/codelibs/fess/crawler/client/http/
│ ├── PlaywrightClient.java # Main crawler client implementation
│ └── PlaywrightClientCreator.java # Factory for creating client instances
├── main/resources/crawler/
│ └── client++.xml # Spring configuration
└── test/
├── java/org/codelibs/fess/crawler/client/http/
│ ├── PlaywrightClientTest.java # Basic functionality tests
│ ├── PlaywrightAuthTest.java # Authentication tests
│ ├── PlaywrightClientProxyTest.java # Proxy configuration tests
│ └── PlaywrightClientSslIgnoreTest.java # SSL tests
└── resources/
├── docroot/ # Test web content
└── sslKeystore/ # SSL certificates for testing
# Full build with tests
mvn clean package
# Run specific test
mvn -Dtest=PlaywrightClientTest test
# Skip tests
mvn clean package -DskipTests
# Format code
mvn net.revelc.code.formatter:formatter-maven-plugin:format
The test suite includes comprehensive tests for:
- Basic functionality: HTML, PDF, image crawling
- Authentication: Integration with Fess auth system
- Proxy configuration: Proxy server and bypass patterns
- SSL handling: Certificate validation and ignoring
- File downloads: Various content types
- Error handling: Network failures and timeouts
Tests use a local Jetty server (CrawlerWebServer
) with test content in src/test/resources/docroot/
.
Property | Default | Description |
---|---|---|
browserName |
chromium |
Browser type: chromium, firefox, webkit |
sharedClient |
false |
Enable shared Playwright worker |
downloadTimeout |
15000 |
Download timeout in milliseconds |
closeTimeout |
15000 |
Resource cleanup timeout in milliseconds |
renderedState |
LOAD |
When to extract content: LOAD, DOMCONTENTLOADED, NETWORKIDLE |
contentWaitDuration |
0 |
Additional wait time before content extraction |
ignoreHttpsErrors |
false |
Skip SSL certificate validation |
Browser installation fails
# Ensure Node.js is installed and run:
npx playwright install --with-deps
Tests fail with "Browser not found"
- Verify Playwright browsers are installed
- Check if running in headless environment (CI/CD)
SSL certificate errors
// For testing only - ignore SSL errors
client.addOption("ignoreHttpsErrors", "true");
Memory issues with large sites
// Reduce resource usage
LaunchOptions options = new LaunchOptions()
.setArgs(Arrays.asList("--no-sandbox", "--disable-dev-shm-usage"));
client.setLaunchOptions(options);
Proxy authentication
// Configure proxy with authentication
NewContextOptions contextOptions = new NewContextOptions()
.setProxy(new Proxy("proxy.example.com:8080")
.setUsername("user")
.setPassword("pass"));
client.setNewContextOptions(contextOptions);
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature
- Make changes and add tests
- Format code:
mvn formatter:format
- Run tests:
mvn test
- Commit changes:
git commit -am 'Add new feature'
- Push branch:
git push origin feature/my-feature
- Submit a Pull Request
- Follow existing code formatting (enforced by formatter-maven-plugin)
- Add comprehensive tests for new features
- Update documentation for public APIs
- Ensure all tests pass before submitting PR
Licensed under the Apache License, Version 2.0. See LICENSE for details.