Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File source after deletion and writing again not working properly #15885

Closed
skapix opened this issue Jan 10, 2023 · 6 comments
Closed

File source after deletion and writing again not working properly #15885

skapix opened this issue Jan 10, 2023 · 6 comments
Labels
source: file Anything `file` source related type: bug A code related bug.

Comments

@skapix
Copy link

skapix commented Jan 10, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After file removal and immediate create and write again, vector may keep file for delition and does not see any new input for this file.

Way to reproduce the bug with the configuration, attached below

  1. Go to working directory:
    cd /tmp
  2. Simulate rotating in bash. We're going to do it with file removal.
    for i in {1..101}; do echo "abc: $i" >> 1.txt; done; rm 1.txt; for i in {1..10100}; do echo "cde: $i" >> 1.txt; done;
  3. check if file is not released by vector:
lsof -nP +L1
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NLINK     NODE NAME
vector    413099      root    9r   REG  259,2   100795     0  8126906 /tmp/1.txt (deleted)

This is the case when vector did not release the file.

If you see no held files, try again from step 2. Usually we need 2-3 attempts to get file not released by vector.

After this happens, vector does not see any input from this file:

$ tail -n 1 vector-1234.log 
{"file":"/tmp/1.txt","host":"host","message":"abc: 101","source_type":"file","timestamp":"2023-01-10T10:52:36.667782990Z"}

Note, this is the last line before file was dropped. No data from new file is presented in output log

Configuration

sources:
  post_dummy_source:
    type: file
    include:
      - /tmp/*.txt
sinks:
  my_sink_id:
    type: file
    inputs:
      - post_*
    compression: none
    encoding:
      codec: json
    path: /tmp/vector-1234.log


### Version

vector 0.26.0 (x86_64-unknown-linux-gnu c6b5bc2 2022-12-05)

### Debug Output

_No response_

### Example Data

_No response_

### Additional Context

The issue happens for local vector and k8s vector

### References

_No response_
@skapix skapix added the type: bug A code related bug. label Jan 10, 2023
@spencergilbert spencergilbert added the source: file Anything `file` source related label Jan 10, 2023
@hhromic
Copy link
Contributor

hhromic commented Jan 10, 2023

What a timing. This morning we just experienced a "too many open files" OS error coming from Vector on a file-based pipeline that also is supposed to be deleting old files. We discovered that indeed Vector did not delete all the files it should and exhausted the open files limit of the OS. The end effect is that lots of old files were left on the server and the disk started to fill up.

@jszwedko jszwedko changed the title File source after delition and writing again not working properly File source after deletion and writing again not working properly Jan 10, 2023
@StephenWakely
Copy link
Contributor

StephenWakely commented Mar 7, 2023

I can only reproduce this issue if I run the same for i in {1..101}; do echo "abc: $i" >> 1.txt; done; rm 1.txt; for i in {1..10100}; do echo "cde: $i" >> 1.txt; done; multiple times.

If I run this command:

for i in {1..101}; do  echo "abc: $(date +%s.%N)" >> 1.txt; done; rm 1.txt; for i in {1..101}; do  echo "cde: $(date +%s.%N)" >> 1.txt; done;

this problem does not seem to occur, since no two files will ever have the same fingerprint.

It's likely the problem is caused because

  • Vector is tailing a file - first line cde: 1.
  • We append abc: .. onto this file. This file is deleted.
  • We create a new file - first line cde: 1.

Since the fingerprint hasn't changed, Vector doesn't realise that this is a new file so it keeps the file handle to the deleted file open.

I'm not sure if there is a solution here. Real world cases should not be creating files with the same fingerprint.

@StephenWakely
Copy link
Contributor

What a timing. This morning we just experienced a "too many open files" OS error coming from Vector on a file-based pipeline that also is supposed to be deleting old files. We discovered that indeed Vector did not delete all the files it should and exhausted the open files limit of the OS. The end effect is that lots of old files were left on the server and the disk started to fill up.

@hhromic Is it possible you are creating multiple files with the same fingerprint?

@hhromic
Copy link
Contributor

hhromic commented Mar 7, 2023

Hi @StephenWakely, thanks for investigating this!
Your findings are quite interesting.

Is it possible you are creating multiple files with the same fingerprint?

We are using the following configuration:

    fingerprint:
      strategy: checksum
      lines: 7
      ignored_header_bytes: 5

However, we are processing gzipped files and as I reported in #13193, the checksum fingerprinter has issues on compressed data. After what you reported, now I'm thinking that this issue (for us) might be related to the one I reported for gzipped data.

Is there a way for me to verify that Vector is seeing the same fingerprints to answer your question?

@hhromic
Copy link
Contributor

hhromic commented Mar 7, 2023

I just realised that the entity deleting the files in our case is Vector itself (configured to do so in the file source).
In the issue reported here and the reproduction case being used, the entity deleting files is external to Vector.

Therefore, it seems our issue is actually different than the one in this report :(.

@skapix
Copy link
Author

skapix commented Mar 9, 2023

This issue happens indeed due to the fingerprint.

We have similar issue in our production environment. Will check whether it happens due to the checksum equality.
If not, will try to create a new ticket to simulate our production environment issue.

Thank you @StephenWakely!

@skapix skapix closed this as completed Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: file Anything `file` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

4 participants