Skip to content

Commit d085b52

Browse files
committed
translated original implementation notes letter from Russian to English
1 parent e4473a7 commit d085b52

File tree

4 files changed

+110
-7
lines changed

4 files changed

+110
-7
lines changed

README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Asynchronous Log File Reader [![Code license](https://img.shields.io/github/license/work-examples/async-log-reader)](LICENSE)
1+
# Asynchronous Log File Reader [![Code license](https://img.shields.io/github/license/work-examples/async-log-reader)](LICENSE)
22

33
**Example project:** Log file reader and filter (`grep` analog).
44
Application reads specified file and writes matchig lines to `stdout`.
@@ -16,16 +16,17 @@ C++ exceptions are not used (forbidden).
1616
| **master** | [![CI status](https://github.com/work-examples/async-log-reader/actions/workflows/build.yml/badge.svg?branch=master)](https://github.com/work-examples/async-log-reader/actions/workflows/build.yml?query=branch%3Amaster) | [![CodeQL Code Analysis Status](https://github.com/work-examples/async-log-reader/actions/workflows/codeql-analysis.yml/badge.svg?branch=master)](https://github.com/work-examples/async-log-reader/actions/workflows/codeql-analysis.yml?query=branch%3Amaster) |
1717
| **develop** | [![CI status](https://github.com/work-examples/async-log-reader/actions/workflows/build.yml/badge.svg?branch=develop)](https://github.com/work-examples/async-log-reader/actions/workflows/build.yml?query=branch%3Adevelop) | \[not applicable\] |
1818

19-
## Test Task Description
19+
## C++ Programmer's Test Task Description
2020

2121
Detailed task description is provided in a separate document:
2222

23-
- [task-description.md](docs/task-description.md)
23+
- [C++ Programmer's Test Task Description](docs/task-description.md) (`docs/task-description.md`)
2424

25-
## Task Implementation Remarks
25+
## Test Task Implementation Notes
2626

27-
Original implementation remarks letter in Russian is provided here:
27+
Original implementation notes letter is provided here in two languages:
2828

29-
- [solution-notes-letter.ru.md](docs/solution-notes-letter.ru.md)
29+
- [EN: Test Task Implementation Notes](docs/implementation-notes-letter.md) (in English, `docs/implementation-notes-letter.md`)
30+
- [RU: Заметки по решению](docs/implementation-notes-letter.ru.md) (in Russian, `docs/implementation-notes-letter.ru.md`)
3031

3132
---

docs/implementation-notes-letter.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Test Task Implementation Notes
2+
3+
## Decision progress
4+
5+
The task execution took much longer than I originally estimated.
6+
7+
It was quite humiliating when the first version was 5-6 times slower than `grep`
8+
at a very high level of speed optimization with all sorts of statically allocated arrays without memory re-allocations.
9+
And I expected that the result should already be better than analogues. Moreover, I had a specialized algorithm,
10+
while `grep` has a different syntax and the universal algorithm inside. I supposed `grep` to be slower by definition.
11+
12+
Originally I had the line match algorithm with dynamic programming and `O(P*N)` memory.
13+
Then I replaced it with another one, faster and with constant memory.
14+
And... it became only 2 times slower than `grep`. It was almost a success `:)`
15+
At the same time, according to the profiler, 80+% of time was spent exactly in the line match algorithm.
16+
17+
I had no doubt that a more efficient algorithm must exist for this task.
18+
The only way to defeat `grep` was to add a few optimizations that simply scrolled faster
19+
algorithm in the most popular scenarios. This change accelerated the matching algorithm by about 6 times,
20+
and the total running time of the program accelerated by 3 times.
21+
The improved algorithm gave a victory over grep by only 25-35%.
22+
23+
After that, according to the profiler, almost half of the time was spent
24+
in a line match algorithm, and near to 40% of the time was spent in a synchronous `ReadFile()` call.
25+
26+
I decided that this is the finest hour of asynchronous file reading!
27+
Thus, the operating system will read the next data block of the file during parsing and processing the previous one.
28+
I implemented and ... nothing. The total running time has not changed.
29+
But the profiler showed a redistribution of time towards the line match algorithm.
30+
It was very strange that the line matching algorithm slowed down. And I still don't understand why it slowed down.
31+
I am convinced that this 40% spent by `ReadFile()` could be compressed to a maximum of 5%
32+
by parallelizing data proofreading and data processing.
33+
Perhaps this is somehow related to the fact that the data is in the system file cache,
34+
and it is not really read from the disk (shorter IRP path).
35+
Perhaps this banal copying of memory in kernel mode is poorly parallelized.
36+
Maybe it was worth rewriting so that reading from disk in a dedicated thread was performed ...
37+
38+
In the next iteration I tried to map the file to memory.
39+
This solution does not meet requirements because it may throw SEH exceptions in case of disk read errors.
40+
And I had doubts about the effectiveness of the speed of loading new pages in this solution.
41+
Result was slightly slower, the total time of the program has increased by 20% percent.
42+
Although it is also strange when the disk cache is warmed up.
43+
Theoretically, if the data is in the disk cache, then it would be possible to map it
44+
to process readonly virtual memory in `O(1)`,
45+
and then save time on transitions to kernel mode during memory scan + save time on copying memory.
46+
47+
## Testing and Notes
48+
49+
I tested using web server log, 2 GB, 5.5 million lines,
50+
average line length was 380 bytes, all lines are no longer than 1024 bytes.
51+
1600 lines out of 5.5M matched the pattern. I chose the pattern `*string*` as the most popular in everyday life.
52+
53+
The SSD drive was used, but I warmed it up so that all the data got to the system file cache.
54+
55+
CPU: `Intel Core i5 8th Gen`, laptop edition.
56+
57+
Application built under `x64` architecture worked faster than under `x86`.
58+
59+
The `FILE_FLAG_SEQUENTIAL_SCAN` flag did not give a performance boost on a warmed cache.
60+
Without warming the cache, it must be measured separately.
61+
62+
Sometimes the application execution time is kept at +25% for a long time.
63+
Most likely this is due to the fact that I have a laptop and CPU cores have economical modes.
64+
65+
The latest application version takes 1.6 seconds to process test data while `grep` takes 2.5 seconds.
66+
67+
**ADDED:**
68+
I also implemented reading the file in a separate thread. The file operation is synchronous,
69+
synchronization between threads is done with the lock free loop (spinlock).
70+
It gave a total gain of 25% over the synchronous and asynchronous API solutions (total work time is 1.2 seconds).
71+
This is 2 times faster than `grep`.
72+
73+
## Implementation features
74+
75+
I kept all four implementations of reading files. You can switch them in code:
76+
77+
```cpp
78+
#if 0
79+
#if 0
80+
CSyncLineReader _lineReader;
81+
#else
82+
CMappingLineReader _lineReader;
83+
#endif
84+
#else
85+
#if 0
86+
CAsyncLineReader _lineReader;
87+
#else
88+
CLockFreeLineReader _lineReader;
89+
#endif
90+
#endif
91+
```
92+
93+
The solution contains unit tests in a separate project based on `gtest` framework.
94+
95+
As required by the challenge, the main console application is built with C++ exceptions disabled and no RTTI.
96+
97+
I have used some parts of the STL at my own risk.
98+
These parts do not use exceptions and work without unnecessary overhead.
99+
I see no reason not to use cheap abstractions that allow you to write cleaner and more error-free code.
100+
I mean all kinds of `std::unique_ptr`, `std::string_view`, `std::optional` and etc
101+
102+
---

docs/task-description.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# C++ Programmer's Test Task
1+
# C++ Programmer's Test Task Description
22

33
It is necessary to write a class in pure C++ that can read huge text log files
44
(hundreds of megabytes, tens of gigabytes) as quickly as possible and produce lines

0 commit comments

Comments
 (0)