[Feature] Optimize Elasticsearch Ingestion: Replace Daily Indices with an ILM Rollover Alias #13384

chj9 · 2025-07-25T01:58:47Z

chj9
Jul 25, 2025

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

1. Background

The current implementation writes data to Elasticsearch by creating a new index for each day (e.g., skywalking_segment-20250724). While this approach is straightforward for low data volumes, it presents significant challenges as the amount of data grows. When daily data volume is high, this strategy leads to massive single-day indices (potentially hundreds of gigabytes), causing severe issues:

Degraded Query Performance: Querying a massive index consumes substantial memory and CPU, resulting in slow queries or even timeouts. This negatively impacts user experience and data analysis efficiency.
Unbalanced Shards: Shards for high-volume days become excessively large, while shards for low-volume days remain small, leading to inefficient resource allocation.

2. Proposed Solution

We propose migrating from the daily index pattern to a strategy that leverages Elasticsearch's built-in Index Lifecycle Management (ILM) combined with a Rollover Alias.

The core concept of this strategy is:

Write to a single, fixed alias (e.g., skywalking_segment). Both writes and queries will target this alias.
Automate index management with an ILM policy. When an index meets a defined condition (e.g., its size reaches 15GB or its age reaches 2d), ILM automatically creates a new index and seamlessly switches the write alias (is_write_index: true) to it.
Automate data retention. The ILM policy will also automatically handle the lifecycle of old data, such as deleting it after 7 days, without any external intervention.

3. Implementation Steps

The complete implementation involves the following four key steps:

Step 1: Create an ILM Policy

Define a policy that specifies the conditions for the rollover and delete actions.

PUT _ilm/policy/skywalking_segment_ilm_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "15gb",
            "max_age": "2d"
          }
        }
      },
      "delete": {
        "min_age": "7d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Explanation: A rollover is triggered when a primary shard reaches 15GB or the index is 2 days old. The data will be automatically deleted after 7 days.

Step 2: Create an Index Template

Create a template to automatically apply the ILM policy and settings to all new indices matching the skywalking_segment-* pattern.

PUT _index_template/skywalking_segment_template
{
  "index_patterns": [
    "skywalking_segment-*"
  ],
  "template": {
    "settings": {
      "index": {
        "refresh_interval": "5s",
        "number_of_shards": "5",
        "number_of_replicas": "0",
        "lifecycle": {
          "name": "skywalking_segment_ilm_policy",
          "rollover_alias": "skywalking_segment" 
        }
      }
    },
    "mappings": {
      "properties": {
        "message": {
          "type": "text"
        }
      }
    }
  }
}

Explanation: Any new index with a name starting with skywalking_segment- will be associated with the skywalking_segment_ilm_policy and use skywalking_segment as its rollover alias.

Step 3: Create the Bootstrap Index

Manually create the very first index and assign the alias to it, explicitly marking it as the write index.

PUT skywalking_segment-000001
{
  "aliases": {
    "skywalking_segment": {
      "is_write_index": true
    }
  }
}

Explanation: This is the "seed" index to start the process. All subsequent indices (skywalking_segment-000002, skywalking_segment-000003, etc.) will be created and managed automatically by ILM.

Step 4: Modify Application Code

This is the most critical change. All logic in the application code that writes to and queries from Elasticsearch must be updated:

Write Operations: The target destination should be changed from a dynamic, date-based index name (e.g., skywalking_segment-20250724) to the fixed alias skywalking_segment.
Query Operations: The query target should also be unified to the alias skywalking_segment. Since the alias points to all relevant active indices (e.g., skywalking_segment-000001, skywalking_segment-000002), querying the alias will search across all necessary data.

4. Advantages

Adopting this solution will yield significant benefits:

Automated Lifecycle Management: Eliminates the need for complex index creation/deletion logic in the application code, handing over responsibility to Elasticsearch and reducing maintenance costs.
Balanced Shard Sizes: By controlling shard size with max_primary_shard_size, we ensure that every shard remains within a healthy and efficient size range, preventing giant shards.
Improved Query Performance: Smaller, well-balanced shards lead to faster query speeds and more stable performance.
Simplified Application Logic: The application code is decoupled from physical index names and timing concerns; it only needs to interact with a fixed alias.
Seamless Index Rollover: The rollover action is atomic, allowing write traffic to transition smoothly from an old index to a new one with no data loss or service interruption.

5. Potential Impact

Data Migration: A strategy will be needed to manage existing daily indices. They can be added to a separate ILM policy that only contains a delete phase, or they can be removed manually after they expire.
Configuration Changes: The project's configuration files will need to be updated, replacing the old index prefix (e.g., skywalking_segment-20250724) with the new write alias (e.g., skywalking_segment).

Conclusion:
This optimization is a critical step to ensure the system remains performant and highly available as data volumes continue to scale. We strongly recommend that the core development team evaluate and adopt this proposal.

Use case

Data Storage, Logging Module, Elasticsearch Integration

Related issues

Optimize storage

Are you willing to submit a pull request to implement this on your own?

Yes I am willing to submit a pull request on my own!

Code of Conduct

I agree to follow this project's Code of Conduct

wu-sheng · 2025-07-25T05:02:24Z

wu-sheng
Jul 25, 2025
Collaborator

This thing is not being mentioned for the first time. But query from alias actually is very slow compared with specific daily index with specific time range. Query performance impact leads to functional issues.
We had several discussions with users, there is no perfect way to fix that.

2 replies

wu-sheng Jul 25, 2025
Collaborator

Also, size based rolling could make data update failing, as we don't only keep minute metrics, and also need to update hour and day metrics if you look into the data closer. ILM could move those data into readonly or uncertain index(alias can't be used for update if it matched mutiple indices).

wu-sheng Jul 25, 2025
Collaborator

This ILM policy was on the table, and can't be proper for all cases.

wu-sheng · 2025-07-25T05:09:31Z

wu-sheng
Jul 25, 2025
Collaborator

Compare to elastic index challenges, the cost of elastic is actually becoming an unacceptable issue, rather than these unbalanced and other issues.
That is why we created BanyanDB and move forward on that. APM actually keeps hybrid data in store, which can't be supported perfectly by traditional DB. In BanyanDB, we have size rolling, tsdb for metrics, log system, and incoming specific trace module.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Optimize Elasticsearch Ingestion: Replace Daily Indices with an ILM Rollover Alias #13384

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Feature] Optimize Elasticsearch Ingestion: Replace Daily Indices with an ILM Rollover Alias #13384

Uh oh!

Uh oh!

chj9 Jul 25, 2025

Search before asking

Description

1. Background

2. Proposed Solution

3. Implementation Steps

4. Advantages

5. Potential Impact

Use case

Related issues

Are you willing to submit a pull request to implement this on your own?

Code of Conduct

Replies: 2 comments · 2 replies

Uh oh!

wu-sheng Jul 25, 2025 Collaborator

Uh oh!

Uh oh!

wu-sheng Jul 25, 2025 Collaborator

Uh oh!

wu-sheng Jul 25, 2025 Collaborator

Uh oh!

wu-sheng Jul 25, 2025 Collaborator

chj9
Jul 25, 2025

Replies: 2 comments 2 replies

wu-sheng
Jul 25, 2025
Collaborator

wu-sheng Jul 25, 2025
Collaborator

wu-sheng Jul 25, 2025
Collaborator

wu-sheng
Jul 25, 2025
Collaborator