Skip to content

Latest commit

 

History

History
498 lines (394 loc) · 28.6 KB

File metadata and controls

498 lines (394 loc) · 28.6 KB

SiteOne Crawler: JSON Output Documentation

Table of Contents

This document describes the structure and content of the JSON output file generated by the SiteOne Crawler. This JSON file contains detailed information about the crawled website, including metadata about the crawl process, results for each visited URL, and various analysis tables.

1. Introduction

The JSON output provides a comprehensive dataset about the crawled website. Key information includes:

  • Crawl Metadata: Details about the crawler execution, such as version, execution time, command used, hostname, and the final user agent.
  • Visited URL Results: For each URL visited during the crawl:
    • URL address
    • HTTP status code
    • Elapsed time for the request (performance)
    • Size of the response body
    • Content type (HTML, CSS, JS, Image, etc.)
    • Caching information (cache flags, lifetime)
    • Additional analysis results stored in the extras field (though empty in the provided example, this often contains specific analyzer outputs).
  • Analysis Tables: Aggregated data and specific findings presented in structured tables:
    • Skipped URLs: Reasons why certain URLs were not crawled (e.g., external domain, disallowed by robots.txt, specific rules).
    • Redirects: List of URLs that resulted in redirects (3xx status codes).
    • 404 Errors: List of URLs that resulted in a 404 Not Found status.
    • SSL/TLS Info: Details about the website's SSL certificate (issuer, subject, validity dates, supported protocols).
    • Performance: Tables listing the fastest and slowest URLs encountered during the crawl.
    • SEO & Content:
      • SEO metadata (title, description, keywords, H1, indexing directives) for HTML pages.
      • OpenGraph metadata (og:title, og:description, og:image, etc.).
      • Heading structure analysis (correctness of H1-H6 hierarchy).
      • Analysis of non-unique titles and descriptions across pages.
    • Technical Details:
      • HTTP Headers: Summary of headers found, their occurrences, and unique values.
      • Caching Analysis: Breakdown of caching strategies by content type and domain.
      • DNS Information: DNS resolution details for the target domain.
      • Security Analysis: Evaluation of security-related HTTP headers.
    • Crawler Statistics: Performance metrics for the crawler itself and individual analyzers.

2. Potential Use Cases

The detailed data within the JSON output enables a wide variety of use cases:

  1. Comprehensive SEO Audits: Analyze titles, descriptions, heading structures, indexing status, and OpenGraph tags across the entire site.
  2. Performance Monitoring & Optimization: Identify the slowest pages and resources, analyze load times, and check caching headers.
  3. Broken Link Checking: Easily extract lists of all 404 errors and the pages where they were found.
  4. Redirect Chain Analysis: Identify and analyze redirect chains (although the example shows no redirects, the structure supports it).
  5. Security Header Audits: Verify the implementation of crucial security headers (CSP, HSTS, X-Frame-Options, etc.) across the site.
  6. Content Inventory & Analysis: Get a list of all crawled resources, their types, sizes, and status codes. Analyze content type distribution.
  7. Website Archiving/Cloning: While the crawler has a dedicated offline export, the JSON contains the list of all discovered resources, which could inform a custom archiving process.
  8. Competitive Analysis: Run the crawler on competitor sites (respecting their robots.txt) to gather insights into their structure, performance, and technology.
  9. CI/CD Integration: Integrate the crawler into deployment pipelines to automatically check for new errors (404s, performance regressions) after deployments.
  10. Technical Debt Assessment: Identify outdated practices, missing security headers, or performance issues that need addressing.

3. Detailed JSON Structure

The JSON output has the following main top-level keys:

3.1. extraColumnsFromAnalysis (Array)

An array of objects defining extra columns that might be added during specific analyses. These seem primarily intended for augmenting report outputs. Each object contains:

  • name (String): The display name of the column.
  • length (Integer): Suggested display length/width.
  • truncate (Boolean): Whether the content should be truncated if it exceeds the length.
  • customMethod, customPattern, customGroup: Fields likely used for custom data extraction logic (null in the example).

3.2. crawler (Object)

Contains metadata about the crawler execution:

  • name (String): Name of the crawler software.
  • version (String): Version of the crawler.
  • executedAt (String): Timestamp when the crawl was executed.
  • command (String): The command-line arguments used to run the crawl.
  • hostname (String): The hostname where the crawler was run.
  • finalUserAgent (String): The User-Agent string used for the HTTP requests.

3.3. results (Array)

An array of objects, where each object represents a single visited URL.

  • url (String): The absolute URL that was visited.
  • status (String): The HTTP status code returned (e.g., "200", "404").
  • elapsedTime (Float): Time taken to fetch the URL in seconds.
  • size (Integer): Size of the response body in bytes.
  • type (Integer): An enum representing the detected content type:
    • 1: HTML
    • 2: JavaScript
    • 3: CSS
    • 4: Image
    • 7: Document (e.g., robots.txt)
    • 8: JSON
    • Other types might exist.
  • cacheTypeFlags (Integer): Bitmask representing detected caching mechanisms (e.g., Cache-Control, ETag, Last-Modified). Needs specific interpretation based on the crawler's internal logic. 31 likely means Cache-Control + ETag + Last-Modified. 19 might indicate Cache-Control + Last-Modified. 32768 might indicate no caching headers found.
  • cacheLifetime (Integer | null): Cache lifetime in seconds derived from Cache-Control: max-age or Expires header. null if no lifetime could be determined.
  • extras (Array | Object): Contains additional data from specific analyzers run on this URL. In the provided example, it's an empty array [], but for HTML pages, it often contains an object with Title, Description, etc.

3.4. tables (Object)

An object where each key is a table identifier (e.g., skipped-summary, 404, seo) and the value is an object describing that table. Each table object contains:

  • aplCode (String): A unique code for the table.
  • title (String): A human-readable title for the table.
  • columns (Object): An object describing the columns of the table. Each key is a column identifier (e.g., reason, url, statusCode). The value is an object detailing the column:
    • aplCode (String): Unique code for the column.
    • name (String): Display name for the column header.
    • width (Integer): Suggested display width (-1 might mean auto).
    • formatter (Object | null): Defines how the data should be formatted (e.g., adding units like 'ms' or 'kB'). Empty object {} might indicate default or specific internal formatting.
    • renderer (Object | null): Defines how the data should be rendered (e.g., adding color or links). Empty object {} might indicate default or specific internal rendering.
    • truncateIfLonger (Boolean): Whether to truncate the value if it exceeds the width.
    • Other fields like formatterWillChangeValueLength, nonBreakingSpaces, escapeOutputHtml, getDataValueCallback, forcedDataType provide more hints for rendering.
  • rows (Array): An array of objects, where each object represents a row in the table. The keys in each row object correspond to the column identifiers defined in columns.
  • position (String): A hint about where this table should typically be positioned in a report (e.g., before-url-table, after-url-table).

Note: The specific content and structure within tables depend heavily on the analyzers enabled during the crawl. The example shows many tables related to performance, SEO, caching, headers, security, and skipped URLs.

4. JSON Schema (Draft)

This is a draft JSON schema based on the provided example. It might need refinement based on other possible outputs.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "SiteOne Crawler JSON Output",
  "description": "Schema for the JSON output file generated by SiteOne Crawler.",
  "type": "object",
  "properties": {
    "extraColumnsFromAnalysis": {
      "description": "Definitions for extra columns used in analyses.",
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "length": { "type": "integer" },
          "truncate": { "type": "boolean" },
          "customMethod": { "type": ["string", "null"] },
          "customPattern": { "type": ["string", "null"] },
          "customGroup": { "type": ["string", "null"] }
        },
        "required": ["name", "length", "truncate"]
      }
    },
    "crawler": {
      "description": "Metadata about the crawler execution.",
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "version": { "type": "string" },
        "executedAt": { "type": "string", "format": "date-time" },
        "command": { "type": "string" },
        "hostname": { "type": "string" },
        "finalUserAgent": { "type": "string" }
      },
      "required": ["name", "version", "executedAt", "command", "hostname", "finalUserAgent"]
    },
    "results": {
      "description": "Array of results for each visited URL.",
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "url": { "type": "string", "format": "uri" },
          "status": { "type": "string" },
          "elapsedTime": { "type": "number" },
          "size": { "type": "integer" },
          "type": { "type": "integer", "description": "Enum for content type (1:HTML, 2:JS, 3:CSS, 4:Image, ...)" },
          "cacheTypeFlags": { "type": "integer", "description": "Bitmask for caching mechanisms" },
          "cacheLifetime": { "type": ["integer", "null"], "description": "Cache lifetime in seconds or null" },
          "extras": {
            "type": ["array", "object"],
            "description": "Additional analysis data for this URL"
          }
        },
        "required": ["url", "status", "elapsedTime", "size", "type", "cacheTypeFlags", "cacheLifetime", "extras"]
      }
    },
    "tables": {
      "description": "Aggregated analysis results presented as tables.",
      "type": "object",
      "additionalProperties": {
        "type": "object",
        "properties": {
          "aplCode": { "type": "string" },
          "title": { "type": "string" },
          "columns": {
            "type": "object",
            "additionalProperties": {
              "type": "object",
              "properties": {
                "aplCode": { "type": "string" },
                "name": { "type": "string" },
                "width": { "type": "integer" },
                "formatter": { "type": ["object", "null"] },
                "renderer": { "type": ["object", "null"] },
                "truncateIfLonger": { "type": "boolean" }
              },
              "required": ["aplCode", "name", "width"]
            }
          },
          "rows": {
            "type": "array",
            "items": {
              "type": "object"
            }
          },
          "position": { "type": "string", "enum": ["before-url-table", "after-url-table"] }
        },
        "required": ["aplCode", "title", "columns", "rows", "position"]
      }
    }
  },
  "required": ["crawler", "results", "tables"]
}

5. Analysis Tables Description (tables key)

This section details the structure and columns of each table found under the tables key in the JSON output, based on the provided example.

5.1. skipped-summary (Skipped URLs Summary)

Provides a summary of skipped URLs grouped by domain and reason.

  • reason: (Integer) An enum indicating why URLs from this domain were skipped (e.g., 1 for external domain, 2 for disallowed by robots.txt - specific codes depend on crawler implementation).
  • domain: (String) The domain name whose URLs were skipped.
  • count: (Integer) The number of unique URLs skipped for this domain and reason.

5.2. skipped (Skipped URLs)

Lists individual URLs that were skipped during the crawl.

  • reason: (Integer) The reason code why the URL was skipped.
  • url: (String) The URL that was skipped.
  • sourceAttr: (Integer) An enum indicating the source type where the skipped URL was found (e.g., 10 might represent an <a> tag's href attribute).
  • sourceUqId: (String) A unique identifier (likely a hash) corresponding to the uqId of the page where the skipped URL was found. This allows linking back to the source page in the results array if needed (though uqId is not directly present in the base results array in the example, it might be added by specific configurations or available internally).

5.3. redirects (Redirected URLs)

Lists URLs that resulted in an HTTP redirect (3xx status code). (Note: This table is empty in the provided example).

  • statusCode: (Integer) The specific redirect status code (e.g., 301, 302).
  • url: (String) The original URL that redirected.
  • targetUrl: (String) The target URL to which the original URL redirected.
  • sourceUqId: (String) Identifier of the page where the redirected URL was found.

5.4. 404 (404 URLs)

Lists URLs that resulted in a "404 Not Found" status code.

  • statusCode: (Integer) The HTTP status code (typically 404).
  • url: (String) The URL that resulted in the 404 error.
  • sourceUqId: (String) Identifier of the page where the broken URL was found.
  • (Additional fields like uqId, sourceAttr, requestTime, size, contentType, etc., are also present in the row data, providing full context similar to the main results array).

5.5. certificate-info (SSL/TLS info)

Provides details about the SSL/TLS certificate of the crawled domain.

  • info: (String) The name of the certificate attribute (e.g., "Issuer", "Subject", "Valid from", "Valid to", "Supported protocols", "RAW certificate output", "RAW protocols output").
  • value: (String | Array) The value of the corresponding certificate attribute. Can be a string or an array (e.g., for supported protocols). Contains detailed raw outputs for certificate and protocol checks.

5.6. fastest-urls (TOP fastest URLs)

Lists the URLs with the lowest request times encountered during the crawl.

  • requestTime: (Float) The time taken to fetch the URL in seconds.
  • statusCode: (Integer) The HTTP status code of the URL.
  • url: (String) The URL itself.
  • (Additional fields like uqId, sourceUqId, size, contentType, extras, etc., are also present, providing full context).

5.7. slowest-urls (TOP slowest URLs)

Lists the URLs with the highest request times encountered during the crawl.

  • requestTime: (Float) The time taken to fetch the URL in seconds.
  • statusCode: (Integer) The HTTP status code of the URL.
  • url: (String) The URL itself.
  • (Additional fields like uqId, sourceUqId, size, contentType, extras, etc., are also present, providing full context).

5.8. seo (SEO metadata)

Provides SEO-related metadata extracted from HTML pages.

  • urlPathAndQuery: (String) The path and query string of the URL.
  • indexing: (Object/String) Information about whether the page is indexable/followable based on robots meta tags and X-Robots-Tag headers. Contains robotsIndex (1=index, 0=noindex), robotsFollow (1=follow, 0=nofollow), deniedByRobotsTxt (boolean).
  • title: (String | null) The content of the <title> tag.
  • h1: (String | null) The content of the first <h1> tag found.
  • description: (String | null) The content of the meta name="description" tag.
  • keywords: (String | null) The content of the meta name="keywords" tag.
  • (This table also includes OpenGraph, Twitter Card, and Heading Tree data within each row object).

5.9. open-graph (OpenGraph metadata)

Provides Open Graph and Twitter Card metadata extracted from HTML pages.

  • urlPathAndQuery: (String) The path and query string of the URL.
  • ogTitle: (String | null) Content of the og:title meta tag.
  • ogDescription: (String | null) Content of the og:description meta tag.
  • ogImage: (String | null) Content of the og:image meta tag.
  • twitterTitle: (String | null) Content of the twitter:title meta tag.
  • twitterDescription: (String | null) Content of the twitter:description meta tag.
  • twitterImage: (String | null) Content of the twitter:image meta tag.
  • (Other OG/Twitter fields like ogType, ogUrl, ogSiteName, twitterCard, twitterSite, twitterCreator are also included in the row data).

5.10. seo-headings (Heading structure)

Provides analysis of the heading (H1-H6) structure for each HTML page.

  • headings: (Object/String) A representation of the heading structure, often a tree or formatted string showing hierarchy and potential errors.
  • headingsCount: (Integer) Total number of headings found on the page.
  • headingsErrorsCount: (Integer) Number of structural errors found in the headings (e.g., skipping levels, incorrect starting level).
  • urlPathAndQuery: (String) The path and query string of the URL.
  • (The raw headingTreeItems array is also included in each row object).

5.11. headers (HTTP headers)

Summarizes the HTTP response headers encountered across all crawled URLs.

  • header: (String) The name of the HTTP header (case-insensitive).
  • occurrences: (Integer) The total number of times this header was found.
  • uniqueValues: (Object | Array) An object or array showing the distinct values found for this header and their counts. If the number of unique values is large, this might be truncated or just show counts.
  • valuesPreview: (String) A preview string showing some of the values encountered.
  • minValue: (Integer | String | null) The minimum value found (relevant for numerical or date headers like Content-Length or Last-Modified).
  • maxValue: (Integer | String | null) The maximum value found.

5.12. headers-values (HTTP header values)

Lists unique values for each HTTP header and their occurrence count.

  • header: (String) The name of the HTTP header.
  • occurrences: (Integer) The number of times this specific value occurred for this header.
  • value: (String) The specific unique value of the HTTP header.

5.13. caching-per-content-type (HTTP Caching by content type)

Analyzes caching effectiveness grouped by general content type (HTML, Image, JS, CSS, etc.).

  • contentType: (String) The general content type category.
  • cacheType: (String) Description of the caching mechanism detected (e.g., "Cache-Control + ETag + Last-Modified", "No cache headers").
  • count: (Integer) Number of URLs matching this content type and cache type.
  • avgLifetime: (Float | null) Average cache lifetime in seconds for URLs in this group (if determinable).
  • minLifetime: (Integer | null) Minimum cache lifetime in seconds.
  • maxLifetime: (Integer | null) Maximum cache lifetime in seconds.

5.14. caching-per-domain (HTTP Caching by domain)

Analyzes caching effectiveness grouped by domain.

  • domain: (String) The domain name.
  • cacheType: (String) Description of the caching mechanism detected.
  • count: (Integer) Number of URLs from this domain matching this cache type.
  • avgLifetime: (Float | null) Average cache lifetime in seconds.
  • minLifetime: (Integer | null) Minimum cache lifetime in seconds.
  • maxLifetime: (Integer | null) Maximum cache lifetime in seconds.

5.15. caching-per-domain-and-content-type (HTTP Caching by domain and content type)

Analyzes caching effectiveness grouped by both domain and general content type.

  • domain: (String) The domain name.
  • contentType: (String) The general content type category.
  • cacheType: (String) Description of the caching mechanism detected.
  • count: (Integer) Number of URLs matching this domain, content type, and cache type.
  • avgLifetime: (Float | null) Average cache lifetime in seconds.
  • minLifetime: (Integer | null) Minimum cache lifetime in seconds.
  • maxLifetime: (Integer | null) Maximum cache lifetime in seconds.

5.16. non-unique-titles (TOP non-unique titles)

Lists page titles that appear on more than one page.

  • count: (Integer) The number of pages sharing this title.
  • title: (String) The non-unique page title.

5.17. non-unique-descriptions (TOP non-unique descriptions)

Lists meta descriptions that appear on more than one page.

  • count: (Integer) The number of pages sharing this description.
  • description: (String) The non-unique meta description content.

5.18. best-practices (Best practices)

Summarizes the results of various best practice checks performed by analyzers.

  • analysisName: (String) The name of the specific best practice check (e.g., "Large inline SVGs", "Heading structure", "Brotli support").
  • ok: (Integer) Count of URLs passing this check.
  • notice: (Integer) Count of URLs with a notice-level finding for this check.
  • warning: (Integer) Count of URLs with a warning-level finding.
  • critical: (Integer) Count of URLs with a critical-level finding.

5.19. accessibility (Accessibility)

Summarizes the results of accessibility checks.

  • analysisName: (String) The name of the specific accessibility check (e.g., "Missing image alt attributes", "Missing html lang attribute").
  • ok: (Integer) Count of elements/pages passing this check.
  • notice: (Integer) Count of notice-level findings.
  • warning: (Integer) Count of warning-level findings.
  • critical: (Integer) Count of critical-level findings.

5.20. source-domains (Source domains)

Provides statistics about the domains from which resources were loaded.

  • domain: (String) The domain name.
  • totals: (String) A summary string showing total count, size, and time for resources from this domain (e.g., "67/30MB/6.2s").
  • HTML, Image, JS, CSS, Document, JSON, etc.: (String) Summary strings (count/size/time) broken down by content type for resources loaded from this domain.
  • totalCount: (Integer) Total number of resources loaded from this domain.

5.21. content-types (Content types)

Summarizes statistics grouped by general content type.

  • contentType: (String) The general content type category (e.g., "HTML", "Image").
  • count: (Integer) Total number of URLs of this content type.
  • totalSize: (Integer) Total size in bytes for this content type.
  • totalTime: (Float) Total time spent fetching resources of this content type.
  • avgTime: (Float) Average time spent fetching a resource of this content type.
  • status20x: (Integer) Count of URLs with a 2xx status code.
  • status40x: (Integer) Count of URLs with a 4xx status code.
  • (Counts for other status code ranges like 3xx, 5xx might also appear depending on results).

5.22. content-types-raw (Content types (MIME types))

Summarizes statistics grouped by the specific MIME type reported in the Content-Type HTTP header.

  • contentType: (String) The raw MIME type string (e.g., "text/html", "image/svg+xml", "text/html; charset=utf-8").
  • count: (Integer) Total number of URLs with this MIME type.
  • totalSize: (Integer) Total size in bytes.
  • totalTime: (Float) Total time spent fetching.
  • avgTime: (Float) Average time spent fetching.
  • status20x: (Integer) Count of URLs with a 2xx status code.
  • status40x: (Integer) Count of URLs with a 4xx status code.

5.23. dns (DNS info)

Shows the DNS resolution information for the crawled domain(s).

  • info: (String) A line of text representing part of the DNS resolution (e.g., the domain name, an IP address, the DNS server used). Presented as a simple text tree.

5.24. security (Security)

Summarizes findings related to security HTTP headers.

  • header: (String) The name of the security header being analyzed (e.g., "Strict-Transport-Security", "X-Frame-Options").
  • ok: (Integer) Count of URLs where the header was configured correctly according to the analyzer's rules.
  • notice: (Integer) Count of URLs with a notice-level finding related to this header.
  • warning: (Integer) Count of URLs with a warning-level finding.
  • critical: (Integer) Count of URLs with a critical-level finding.
  • recommendation: (Object | Array) Contains textual recommendations based on the findings for this header.

5.25. analysis-stats (Analysis stats)

Provides performance metrics for individual analyzer methods.

  • classAndMethod: (String) The class and method name of the analyzer function.
  • execTime: (Float) Total execution time in seconds spent in this method across all relevant URLs/data points.
  • execCount: (Integer) The number of times this method was executed.

6. Note on Text Output

While this document focuses on the JSON output, SiteOne Crawler also offers a simpler Text output format (--output-text-file). The Text output provides a human-readable summary suitable for quick review in a terminal or text editor.

See the Text Output Documentation for more details on the Text format.