- 1. Introduction
- 2. Potential Use Cases
- 3. Detailed JSON Structure
- 4. JSON Schema (Draft)
- 5. Analysis Tables Description (
tableskey)- 5.1.
skipped-summary(Skipped URLs Summary) - 5.2.
skipped(Skipped URLs) - 5.3.
redirects(Redirected URLs) - 5.4.
404(404 URLs) - 5.5.
certificate-info(SSL/TLS info) - 5.6.
fastest-urls(TOP fastest URLs) - 5.7.
slowest-urls(TOP slowest URLs) - 5.8.
seo(SEO metadata) - 5.9.
open-graph(OpenGraph metadata) - 5.10.
seo-headings(Heading structure) - 5.11.
headers(HTTP headers) - 5.12.
headers-values(HTTP header values) - 5.13.
caching-per-content-type(HTTP Caching by content type) - 5.14.
caching-per-domain(HTTP Caching by domain) - 5.15.
caching-per-domain-and-content-type(HTTP Caching by domain and content type) - 5.16.
non-unique-titles(TOP non-unique titles) - 5.17.
non-unique-descriptions(TOP non-unique descriptions) - 5.18.
best-practices(Best practices) - 5.19.
accessibility(Accessibility) - 5.20.
source-domains(Source domains) - 5.21.
content-types(Content types) - 5.22.
content-types-raw(Content types (MIME types)) - 5.23.
dns(DNS info) - 5.24.
security(Security) - 5.25.
analysis-stats(Analysis stats)
- 5.1.
- 6. Note on Text Output
This document describes the structure and content of the JSON output file generated by the SiteOne Crawler. This JSON file contains detailed information about the crawled website, including metadata about the crawl process, results for each visited URL, and various analysis tables.
The JSON output provides a comprehensive dataset about the crawled website. Key information includes:
- Crawl Metadata: Details about the crawler execution, such as version, execution time, command used, hostname, and the final user agent.
- Visited URL Results: For each URL visited during the crawl:
- URL address
- HTTP status code
- Elapsed time for the request (performance)
- Size of the response body
- Content type (HTML, CSS, JS, Image, etc.)
- Caching information (cache flags, lifetime)
- Additional analysis results stored in the
extrasfield (though empty in the provided example, this often contains specific analyzer outputs).
- Analysis Tables: Aggregated data and specific findings presented in structured tables:
- Skipped URLs: Reasons why certain URLs were not crawled (e.g., external domain, disallowed by robots.txt, specific rules).
- Redirects: List of URLs that resulted in redirects (3xx status codes).
- 404 Errors: List of URLs that resulted in a 404 Not Found status.
- SSL/TLS Info: Details about the website's SSL certificate (issuer, subject, validity dates, supported protocols).
- Performance: Tables listing the fastest and slowest URLs encountered during the crawl.
- SEO & Content:
- SEO metadata (title, description, keywords, H1, indexing directives) for HTML pages.
- OpenGraph metadata (og:title, og:description, og:image, etc.).
- Heading structure analysis (correctness of H1-H6 hierarchy).
- Analysis of non-unique titles and descriptions across pages.
- Technical Details:
- HTTP Headers: Summary of headers found, their occurrences, and unique values.
- Caching Analysis: Breakdown of caching strategies by content type and domain.
- DNS Information: DNS resolution details for the target domain.
- Security Analysis: Evaluation of security-related HTTP headers.
- Crawler Statistics: Performance metrics for the crawler itself and individual analyzers.
The detailed data within the JSON output enables a wide variety of use cases:
- Comprehensive SEO Audits: Analyze titles, descriptions, heading structures, indexing status, and OpenGraph tags across the entire site.
- Performance Monitoring & Optimization: Identify the slowest pages and resources, analyze load times, and check caching headers.
- Broken Link Checking: Easily extract lists of all 404 errors and the pages where they were found.
- Redirect Chain Analysis: Identify and analyze redirect chains (although the example shows no redirects, the structure supports it).
- Security Header Audits: Verify the implementation of crucial security headers (CSP, HSTS, X-Frame-Options, etc.) across the site.
- Content Inventory & Analysis: Get a list of all crawled resources, their types, sizes, and status codes. Analyze content type distribution.
- Website Archiving/Cloning: While the crawler has a dedicated offline export, the JSON contains the list of all discovered resources, which could inform a custom archiving process.
- Competitive Analysis: Run the crawler on competitor sites (respecting their
robots.txt) to gather insights into their structure, performance, and technology. - CI/CD Integration: Integrate the crawler into deployment pipelines to automatically check for new errors (404s, performance regressions) after deployments.
- Technical Debt Assessment: Identify outdated practices, missing security headers, or performance issues that need addressing.
The JSON output has the following main top-level keys:
An array of objects defining extra columns that might be added during specific analyses. These seem primarily intended for augmenting report outputs. Each object contains:
name(String): The display name of the column.length(Integer): Suggested display length/width.truncate(Boolean): Whether the content should be truncated if it exceeds the length.customMethod,customPattern,customGroup: Fields likely used for custom data extraction logic (null in the example).
Contains metadata about the crawler execution:
name(String): Name of the crawler software.version(String): Version of the crawler.executedAt(String): Timestamp when the crawl was executed.command(String): The command-line arguments used to run the crawl.hostname(String): The hostname where the crawler was run.finalUserAgent(String): The User-Agent string used for the HTTP requests.
An array of objects, where each object represents a single visited URL.
url(String): The absolute URL that was visited.status(String): The HTTP status code returned (e.g., "200", "404").elapsedTime(Float): Time taken to fetch the URL in seconds.size(Integer): Size of the response body in bytes.type(Integer): An enum representing the detected content type:1: HTML2: JavaScript3: CSS4: Image7: Document (e.g., robots.txt)8: JSON- Other types might exist.
cacheTypeFlags(Integer): Bitmask representing detected caching mechanisms (e.g., Cache-Control, ETag, Last-Modified). Needs specific interpretation based on the crawler's internal logic.31likely means Cache-Control + ETag + Last-Modified.19might indicate Cache-Control + Last-Modified.32768might indicate no caching headers found.cacheLifetime(Integer | null): Cache lifetime in seconds derived fromCache-Control: max-ageorExpiresheader.nullif no lifetime could be determined.extras(Array | Object): Contains additional data from specific analyzers run on this URL. In the provided example, it's an empty array[], but for HTML pages, it often contains an object withTitle,Description, etc.
An object where each key is a table identifier (e.g., skipped-summary, 404, seo) and the value is an object describing that table. Each table object contains:
aplCode(String): A unique code for the table.title(String): A human-readable title for the table.columns(Object): An object describing the columns of the table. Each key is a column identifier (e.g.,reason,url,statusCode). The value is an object detailing the column:aplCode(String): Unique code for the column.name(String): Display name for the column header.width(Integer): Suggested display width (-1 might mean auto).formatter(Object | null): Defines how the data should be formatted (e.g., adding units like 'ms' or 'kB'). Empty object{}might indicate default or specific internal formatting.renderer(Object | null): Defines how the data should be rendered (e.g., adding color or links). Empty object{}might indicate default or specific internal rendering.truncateIfLonger(Boolean): Whether to truncate the value if it exceeds the width.- Other fields like
formatterWillChangeValueLength,nonBreakingSpaces,escapeOutputHtml,getDataValueCallback,forcedDataTypeprovide more hints for rendering.
rows(Array): An array of objects, where each object represents a row in the table. The keys in each row object correspond to the column identifiers defined incolumns.position(String): A hint about where this table should typically be positioned in a report (e.g.,before-url-table,after-url-table).
Note: The specific content and structure within tables depend heavily on the analyzers enabled during the crawl. The example shows many tables related to performance, SEO, caching, headers, security, and skipped URLs.
This is a draft JSON schema based on the provided example. It might need refinement based on other possible outputs.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "SiteOne Crawler JSON Output",
"description": "Schema for the JSON output file generated by SiteOne Crawler.",
"type": "object",
"properties": {
"extraColumnsFromAnalysis": {
"description": "Definitions for extra columns used in analyses.",
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"length": { "type": "integer" },
"truncate": { "type": "boolean" },
"customMethod": { "type": ["string", "null"] },
"customPattern": { "type": ["string", "null"] },
"customGroup": { "type": ["string", "null"] }
},
"required": ["name", "length", "truncate"]
}
},
"crawler": {
"description": "Metadata about the crawler execution.",
"type": "object",
"properties": {
"name": { "type": "string" },
"version": { "type": "string" },
"executedAt": { "type": "string", "format": "date-time" },
"command": { "type": "string" },
"hostname": { "type": "string" },
"finalUserAgent": { "type": "string" }
},
"required": ["name", "version", "executedAt", "command", "hostname", "finalUserAgent"]
},
"results": {
"description": "Array of results for each visited URL.",
"type": "array",
"items": {
"type": "object",
"properties": {
"url": { "type": "string", "format": "uri" },
"status": { "type": "string" },
"elapsedTime": { "type": "number" },
"size": { "type": "integer" },
"type": { "type": "integer", "description": "Enum for content type (1:HTML, 2:JS, 3:CSS, 4:Image, ...)" },
"cacheTypeFlags": { "type": "integer", "description": "Bitmask for caching mechanisms" },
"cacheLifetime": { "type": ["integer", "null"], "description": "Cache lifetime in seconds or null" },
"extras": {
"type": ["array", "object"],
"description": "Additional analysis data for this URL"
}
},
"required": ["url", "status", "elapsedTime", "size", "type", "cacheTypeFlags", "cacheLifetime", "extras"]
}
},
"tables": {
"description": "Aggregated analysis results presented as tables.",
"type": "object",
"additionalProperties": {
"type": "object",
"properties": {
"aplCode": { "type": "string" },
"title": { "type": "string" },
"columns": {
"type": "object",
"additionalProperties": {
"type": "object",
"properties": {
"aplCode": { "type": "string" },
"name": { "type": "string" },
"width": { "type": "integer" },
"formatter": { "type": ["object", "null"] },
"renderer": { "type": ["object", "null"] },
"truncateIfLonger": { "type": "boolean" }
},
"required": ["aplCode", "name", "width"]
}
},
"rows": {
"type": "array",
"items": {
"type": "object"
}
},
"position": { "type": "string", "enum": ["before-url-table", "after-url-table"] }
},
"required": ["aplCode", "title", "columns", "rows", "position"]
}
}
},
"required": ["crawler", "results", "tables"]
}This section details the structure and columns of each table found under the tables key in the JSON output, based on the provided example.
Provides a summary of skipped URLs grouped by domain and reason.
reason: (Integer) An enum indicating why URLs from this domain were skipped (e.g.,1for external domain,2for disallowed by robots.txt - specific codes depend on crawler implementation).domain: (String) The domain name whose URLs were skipped.count: (Integer) The number of unique URLs skipped for this domain and reason.
Lists individual URLs that were skipped during the crawl.
reason: (Integer) The reason code why the URL was skipped.url: (String) The URL that was skipped.sourceAttr: (Integer) An enum indicating the source type where the skipped URL was found (e.g.,10might represent an<a>tag'shrefattribute).sourceUqId: (String) A unique identifier (likely a hash) corresponding to theuqIdof the page where the skipped URL was found. This allows linking back to the source page in theresultsarray if needed (thoughuqIdis not directly present in the baseresultsarray in the example, it might be added by specific configurations or available internally).
Lists URLs that resulted in an HTTP redirect (3xx status code). (Note: This table is empty in the provided example).
statusCode: (Integer) The specific redirect status code (e.g., 301, 302).url: (String) The original URL that redirected.targetUrl: (String) The target URL to which the original URL redirected.sourceUqId: (String) Identifier of the page where the redirected URL was found.
Lists URLs that resulted in a "404 Not Found" status code.
statusCode: (Integer) The HTTP status code (typically 404).url: (String) The URL that resulted in the 404 error.sourceUqId: (String) Identifier of the page where the broken URL was found.- (Additional fields like
uqId,sourceAttr,requestTime,size,contentType, etc., are also present in the row data, providing full context similar to the mainresultsarray).
Provides details about the SSL/TLS certificate of the crawled domain.
info: (String) The name of the certificate attribute (e.g., "Issuer", "Subject", "Valid from", "Valid to", "Supported protocols", "RAW certificate output", "RAW protocols output").value: (String | Array) The value of the corresponding certificate attribute. Can be a string or an array (e.g., for supported protocols). Contains detailed raw outputs for certificate and protocol checks.
Lists the URLs with the lowest request times encountered during the crawl.
requestTime: (Float) The time taken to fetch the URL in seconds.statusCode: (Integer) The HTTP status code of the URL.url: (String) The URL itself.- (Additional fields like
uqId,sourceUqId,size,contentType,extras, etc., are also present, providing full context).
Lists the URLs with the highest request times encountered during the crawl.
requestTime: (Float) The time taken to fetch the URL in seconds.statusCode: (Integer) The HTTP status code of the URL.url: (String) The URL itself.- (Additional fields like
uqId,sourceUqId,size,contentType,extras, etc., are also present, providing full context).
Provides SEO-related metadata extracted from HTML pages.
urlPathAndQuery: (String) The path and query string of the URL.indexing: (Object/String) Information about whether the page is indexable/followable based on robots meta tags and X-Robots-Tag headers. ContainsrobotsIndex(1=index, 0=noindex),robotsFollow(1=follow, 0=nofollow),deniedByRobotsTxt(boolean).title: (String | null) The content of the<title>tag.h1: (String | null) The content of the first<h1>tag found.description: (String | null) The content of themeta name="description"tag.keywords: (String | null) The content of themeta name="keywords"tag.- (This table also includes OpenGraph, Twitter Card, and Heading Tree data within each row object).
Provides Open Graph and Twitter Card metadata extracted from HTML pages.
urlPathAndQuery: (String) The path and query string of the URL.ogTitle: (String | null) Content of theog:titlemeta tag.ogDescription: (String | null) Content of theog:descriptionmeta tag.ogImage: (String | null) Content of theog:imagemeta tag.twitterTitle: (String | null) Content of thetwitter:titlemeta tag.twitterDescription: (String | null) Content of thetwitter:descriptionmeta tag.twitterImage: (String | null) Content of thetwitter:imagemeta tag.- (Other OG/Twitter fields like
ogType,ogUrl,ogSiteName,twitterCard,twitterSite,twitterCreatorare also included in the row data).
Provides analysis of the heading (H1-H6) structure for each HTML page.
headings: (Object/String) A representation of the heading structure, often a tree or formatted string showing hierarchy and potential errors.headingsCount: (Integer) Total number of headings found on the page.headingsErrorsCount: (Integer) Number of structural errors found in the headings (e.g., skipping levels, incorrect starting level).urlPathAndQuery: (String) The path and query string of the URL.- (The raw
headingTreeItemsarray is also included in each row object).
Summarizes the HTTP response headers encountered across all crawled URLs.
header: (String) The name of the HTTP header (case-insensitive).occurrences: (Integer) The total number of times this header was found.uniqueValues: (Object | Array) An object or array showing the distinct values found for this header and their counts. If the number of unique values is large, this might be truncated or just show counts.valuesPreview: (String) A preview string showing some of the values encountered.minValue: (Integer | String | null) The minimum value found (relevant for numerical or date headers likeContent-LengthorLast-Modified).maxValue: (Integer | String | null) The maximum value found.
Lists unique values for each HTTP header and their occurrence count.
header: (String) The name of the HTTP header.occurrences: (Integer) The number of times this specific value occurred for this header.value: (String) The specific unique value of the HTTP header.
Analyzes caching effectiveness grouped by general content type (HTML, Image, JS, CSS, etc.).
contentType: (String) The general content type category.cacheType: (String) Description of the caching mechanism detected (e.g., "Cache-Control + ETag + Last-Modified", "No cache headers").count: (Integer) Number of URLs matching this content type and cache type.avgLifetime: (Float | null) Average cache lifetime in seconds for URLs in this group (if determinable).minLifetime: (Integer | null) Minimum cache lifetime in seconds.maxLifetime: (Integer | null) Maximum cache lifetime in seconds.
Analyzes caching effectiveness grouped by domain.
domain: (String) The domain name.cacheType: (String) Description of the caching mechanism detected.count: (Integer) Number of URLs from this domain matching this cache type.avgLifetime: (Float | null) Average cache lifetime in seconds.minLifetime: (Integer | null) Minimum cache lifetime in seconds.maxLifetime: (Integer | null) Maximum cache lifetime in seconds.
Analyzes caching effectiveness grouped by both domain and general content type.
domain: (String) The domain name.contentType: (String) The general content type category.cacheType: (String) Description of the caching mechanism detected.count: (Integer) Number of URLs matching this domain, content type, and cache type.avgLifetime: (Float | null) Average cache lifetime in seconds.minLifetime: (Integer | null) Minimum cache lifetime in seconds.maxLifetime: (Integer | null) Maximum cache lifetime in seconds.
Lists page titles that appear on more than one page.
count: (Integer) The number of pages sharing this title.title: (String) The non-unique page title.
Lists meta descriptions that appear on more than one page.
count: (Integer) The number of pages sharing this description.description: (String) The non-unique meta description content.
Summarizes the results of various best practice checks performed by analyzers.
analysisName: (String) The name of the specific best practice check (e.g., "Large inline SVGs", "Heading structure", "Brotli support").ok: (Integer) Count of URLs passing this check.notice: (Integer) Count of URLs with a notice-level finding for this check.warning: (Integer) Count of URLs with a warning-level finding.critical: (Integer) Count of URLs with a critical-level finding.
Summarizes the results of accessibility checks.
analysisName: (String) The name of the specific accessibility check (e.g., "Missing image alt attributes", "Missing html lang attribute").ok: (Integer) Count of elements/pages passing this check.notice: (Integer) Count of notice-level findings.warning: (Integer) Count of warning-level findings.critical: (Integer) Count of critical-level findings.
Provides statistics about the domains from which resources were loaded.
domain: (String) The domain name.totals: (String) A summary string showing total count, size, and time for resources from this domain (e.g., "67/30MB/6.2s").HTML,Image,JS,CSS,Document,JSON, etc.: (String) Summary strings (count/size/time) broken down by content type for resources loaded from this domain.totalCount: (Integer) Total number of resources loaded from this domain.
Summarizes statistics grouped by general content type.
contentType: (String) The general content type category (e.g., "HTML", "Image").count: (Integer) Total number of URLs of this content type.totalSize: (Integer) Total size in bytes for this content type.totalTime: (Float) Total time spent fetching resources of this content type.avgTime: (Float) Average time spent fetching a resource of this content type.status20x: (Integer) Count of URLs with a 2xx status code.status40x: (Integer) Count of URLs with a 4xx status code.- (Counts for other status code ranges like 3xx, 5xx might also appear depending on results).
Summarizes statistics grouped by the specific MIME type reported in the Content-Type HTTP header.
contentType: (String) The raw MIME type string (e.g., "text/html", "image/svg+xml", "text/html; charset=utf-8").count: (Integer) Total number of URLs with this MIME type.totalSize: (Integer) Total size in bytes.totalTime: (Float) Total time spent fetching.avgTime: (Float) Average time spent fetching.status20x: (Integer) Count of URLs with a 2xx status code.status40x: (Integer) Count of URLs with a 4xx status code.
Shows the DNS resolution information for the crawled domain(s).
info: (String) A line of text representing part of the DNS resolution (e.g., the domain name, an IP address, the DNS server used). Presented as a simple text tree.
Summarizes findings related to security HTTP headers.
header: (String) The name of the security header being analyzed (e.g., "Strict-Transport-Security", "X-Frame-Options").ok: (Integer) Count of URLs where the header was configured correctly according to the analyzer's rules.notice: (Integer) Count of URLs with a notice-level finding related to this header.warning: (Integer) Count of URLs with a warning-level finding.critical: (Integer) Count of URLs with a critical-level finding.recommendation: (Object | Array) Contains textual recommendations based on the findings for this header.
Provides performance metrics for individual analyzer methods.
classAndMethod: (String) The class and method name of the analyzer function.execTime: (Float) Total execution time in seconds spent in this method across all relevant URLs/data points.execCount: (Integer) The number of times this method was executed.
While this document focuses on the JSON output, SiteOne Crawler also offers a simpler Text output format (--output-text-file). The Text output provides a human-readable summary suitable for quick review in a terminal or text editor.
See the Text Output Documentation for more details on the Text format.