vProto - is an advanced description of regular expressions, allowing for the specification of a Finite State Machine, including states, behaviors, and transition conditions. This description serves as the basis for generating C++ code, tailored for data parsing purposes.
State Machine descriptions are done using tokens. Tokens written on a single line denote sequential traversal (once one is completed, it moves to the next). Branching occurs when transitioning to a new line with an increase in depth (increase in indentation). Lines at the same depth correspond to branches. At the end of a line, there may be the presence of the symbol \ which symbolizes continuation of the line (including possible branching), and it will be used during transition in case branching is terminated.
Example:The token's name begins with the line number, followed by the token number within the line. Essentially, at the beginning of any description, the first token represents branching. In the example, this corresponds to either token_1_1 or token_7_1 (because they are at the same depth or indentation level). Let's assume a transition to token_1_1; then token_1_2 and token_1_3 will be executed sequentially, one after the other, since they are on the same line. After token_1_3, branching occurs to token_2_1, token_3_1, and token_6_1 because they are at the same depth. Let's say a transition occurs to token_3_1; then after its successful completion, there will be a transition to token_3_2, and then branching occurs again between token_4_1 and token_5_1 (as they are at the same depth). Assuming a transition to token_4_1, then next will be token_4_2, which marks the end of the branch. After token_4_2, a return back occurs. The return back continues until a \ symbol is found at the end of the branch. The presence of this symbol on line 3 indicates that the transition will be made back to line 1, and the presence of \ on line 1 returns to the beginning. So, upon completion of token_4_2, the transition will be to the beginning, to the branching of token_1_1 and token_7_1. Another behavior occurs upon completion of token_8_2 or token_9_2; for them, branching occurs on line 7, where \ is absent. This means that upon completion of lines 8 or 9, the transition will be made to the branching in line 7, i.e., between token_8_1 and token_9_1, creating a cyclic transition through branching.
The successful passage of a token (and consequently, transition to the next) depends on the incoming data. Sometimes tokens can be optional or have certain requirements for their repetition. It's possible to modify token traversal properties by adding additional options using parentheses, for example: (min=5, max=10, init=100).
If incoming characters do not match the token, an exception will occur (preventing a transition to the next token), which can be caught; otherwise, parsing will exit. To catch the exception, you need to add a branch with 'catch:' specified. If a '\' character exists in the branch, it will be analyzed as part of the parent branch catch.
The demonstration of using examples can be found at the following URLDescribes the range of permissible values for the incoming byte. The expected format is equivalent to standard range in regular expressions. This token supports:
As mentioned earlier, when initializing variables, the main type is a range, which can describe all variables. However, there are also shorter forms of variable notation:
"casesensitive" or 'no-casesensitive' - describes constant words or binary values. For example, in the HTTP protocol, words like "GET" or "HTTP/1.1" are case-sensitive, while headers like 'Content-Length' or 'Content-type' can be written using uppercase or lowercase characters. You can also describe the binary value of a character, like in ranges.
{ c++ code } - calling user's C++ code, this token doesn't use incoming bytes but can be used for modifying variables, invoking callbacks, determining branching conditions, or during debugging stages. This code is called as a function, and inside it, the user can "return false" (by default, it automatically returns true) - indicating successful or unsuccessful token processing. Examples:
b8:type { printf("Read Type: %u
", type); }
{ return type == 1; } { userFunction1(); }
{ return type == 2; } { userFunction2(); }
{ return type == 3; } { userFunction3(); }
C++ insertion is used for debugging\printf function (automatically setting successful token processing return true), as well as for branching: the "return type == ..." returns true or false, thereby selecting a branch transition.
If a user wishes to add their own functions or repositories, they need to define the "OutputClassName". The parser will inherit from this class and access its variables and functions. It is recommended to inherit OutputClassName from the demoResult state structure.
notify:userFunction - essentially replaces the C++ insert { userFunction(); }, but shorter and more efficient in terms of performance.begin:userFunction end:userFunction - transfers to the user all data between the beginning and the end.
Example(will return to the user all the data that was used in the tokens between begin and end):
"GET" [ \t]+ begin:userFunction [^ \t]+ end:userFunction [ \t]+ "HTTP/" [0-9] "." [0-9] "\r"? "\n"
This tokens are used for transitions within the state tree:
<listTokensCase1, listTokensCase2,..> - this token allows creating nested branching variations within itself, describing a sequence of regular tokens and comma-separated alternative options. Its presence changes the logic of the entire state machine by creating parallel states, enabling simultaneous processing across multiple states. Completion of any sequence leads to the termination of other parallel states, essentially, the 'bang' token is automatically set after this token. This token significantly simplifies understanding of the finite state machine, but working in parallel states is always slower than working in a single state. It is recommended to minimize the description of variations in this token so that options are discarded as quickly as possible.
Example:
<"GET", "POST", "PUT"> - All three states are processed in parallel until one of them wins. If the byte G is received, then only the GET branch has a chance of passing; POST and PUT automatically fail. But if the byte P is received, then we will be simultaneously in the POST and PUT states, waiting for other symbols to determine the state.
Example (parallel states):
"GET" [ \t]+ \
<'url-1'> [ \t]+
<'url-2'> [ \t]+
<'url-3'> [ \t]+
All three states url-1, url-2, url-3 are considered in parallel as data arrives (which can come byte by byte). Decisions are made about which states to exclude gradually.
Example (no parallel states):
"GET" [ \t]+ \
'url-1' [ \t]+
'url-2' [ \t]+
'url-3' [ \t]+
Without using <>, branching will not be considered as parallel states, and transitions to url-2 and url-3 will not be possible because upon the arrival of the symbol 'u', the transition to the first, more prioritized branch url-1 (because it first) will be chosen, and if, for example, a url-2 comes next, the transition to url-2 will not occur. An exception will occur at the utl-1 branch.
Special attention is given to the performance of the generated code, even without special instructions (which exist in x86-64), to ensure code universality and its operation across multiple platforms.
Recommendations for achieving maximum performance:
vProto | vProto-SSE4.2 | boost-1.5::http | |
---|---|---|---|
ARM-8 rev1 2200mhz | 11.51gbit/s (3.34m req/s) | doesn't support | 3.43gbit/s (997.09k req/s) |
AMD Ryzen9 5950X | 13.28gbit/s (3.86m req/s) | 15.60gbit/s (4.53m req/s) | 5.19gbit/s (1.50m req/s) |
AMD Ryzen7 8845hs | 29.77gbit/s (8.65m req/s) | 41.90gbit/s (12.18m req/s) | 5.48gbit/s (1.59m req/s) |
Intel Xeon W-2223 | 13.74gbit/s (3.99m req/s) | 25.50gbit/s (7.41m req/s) | 4.13gbit/s (1.20m req/s) |
Intel Xeon Gold 6348 | 16.85gbit/s (4.89m req/s) | 26.30gbit/s (7.64m req/s) | 5.44gbit/s (1.58m req/s) |
Intel i7-1065G7 | 18.53gbit/s (5.38m req/s) | 27.00gbit/s (7.84m req/s) | 4.45gbit/s (1.29m req/s) |
AMD EPYC 9454P | 21.14gbit/s (6.14m req/s) | 31.30gbit/s (9.09m req/s) | 6.10gbit/s (1.77m req/s) |
Intel i9-12900k | 31.22gbit/s (9.07m req/s) | 40.80gbit/s (11.86m req/s) | 9.88gbit/s (2.87m req/s) |
Intel i9-13900hx | 27.50gbit/s (7.99m req/s) | 41.70gbit/s (12.12m req/s) | 9.73gbit/s (2.82m req/s) |
AMD Ryzen 9 9950X3D | 43.70gbit/s (12.70m req/s) | 59.30gbit/s (17.23m req/s) | 13.60gbit/s (3.95m req/s) |
vProto | vProto-SSE4.2 | boost-1.5::http | |
---|---|---|---|
ARM-8 rev1 2200mhz | 13.57gbit/s (3.94m req/s) | doesn't support | 4.42gbit/s (1.28m req/s) |
AMD Ryzen9 5950X | 13.77gbit/s (4.00m req/s) | 16.30gbit/s (4.73m req/s) | 6.51gbit/s (1.89m req/s) |
AMD Ryzen7 8845hs | 33.23gbit/s (9.65m req/s) | 57.60gbit/s (16.74m req/s) | 6.87gbit/s (1.99m req/s) |
Intel Xeon W-2223 | 17.72gbit/s (5.15m req/s) | 29.30gbit/s (8.51m req/s) | 5.11gbit/s (1.48m req/s) |
Intel Xeon Gold 6348 | 18.63gbit/s (5.41m req/s) | 32.80gbit/s (9.53m req/s) | 5.91gbit/s (1.71m req/s) |
Intel i7-1065G7 | 19.91gbit/s (5.78m req/s) | 36.50gbit/s (10.61m req/s) | 5.32gbit/s (1.54m req/s) |
AMD EPYC 9454P | 25.73gbit/s (7.47m req/s) | 44.20gbit/s (12.84m req/s) | 8.73gbit/s (2.53m req/s) |
Intel i9-12900k | 39.67gbit/s (11.53m req/s) | 62.00gbit/s (18.02m req/s) | 11.47gbit/s (3.33m req/s) |
Intel i9-13900hx | 40.53gbit/s (11.78m req/s) | 63.70gbit/s (18.51m req/s) | 11.59gbit/s (3.36m req/s) |
AMD Ryzen 9 9950X3D | 44.20gbit/s (12.84m req/s) | 72.00gbit/s (20.93m req/s) | 14.78gbit/s (4.29m req/s) |
jsonFlow | jsonPerf | rapidJson-v1.1.0 | |
---|---|---|---|
ARM-8 rev1 2200mhz | 4.91gbit/s (771.04k req/s) | 8.73gbit/s (1.37m req/s) | 4.11gbit/s (645.41k req/s) |
AMD Ryzen9 5950X | 6.10gbit/s (957.91k req/s) | 16.20gbit/s (2.54m req/s) | 6.33gbit/s (994.03k req/s) |
AMD Ryzen7 8845hs | 11.00gbit/s (1.72m req/s) | 19.77gbit/s (3.10m req/s) | 9.44gbit/s (1.48m req/s) |
Intel Xeon W-2223 | 7.31gbit/s (1.14m req/s) | 9.13gbit/s (1.43m req/s) | 4.88gbit/s (766.33k req/s) |
Intel Xeon Gold 6348 | 7.18gbit/s (1.12m req/s) | 11.14gbit/s (1.74m req/s) | 5.83gbit/s (915.51k req/s) |
Intel i7-1065G7 | 7.54gbit/s (1.18m req/s) | 11.20gbit/s (1.75m req/s) | 8.45gbit/s (1.32m req/s) |
AMD EPYC 9454P | 8.13gbit/s (2.36m req/s) | 14.72gbit/s (4.27m req/s) | 9.57gbit/s (2.78m req/s) |
Intel i9-12900k | 11.09gbit/s (1.74m req/s) | 18.41gbit/s (2.89m req/s) | 12.29gbit/s (1.92m req/s) |
Intel i9-13900hx | 13.22gbit/s (2.07m req/s) | 22.64gbit/s (3.55m req/s) | 13.27gbit/s (2.08m req/s) |
AMD Ryzen 9 9950X3D | 17.00gbit/s (2.66m req/s) | 26.81gbit/s (4.21m req/s) | 15.60gbit/s (2.44m req/s) |
jsonFlow | jsonPerf | rapidJson-v1.1.0 | |
---|---|---|---|
ARM-8 rev1 2200mhz | 5.14gbit/s (807.16k req/s) | 9.25gbit/s (1.45m req/s) | 2.93gbit/s (460.11k req/s) |
AMD Ryzen9 5950X | 6.07gbit/s (953.20k req/s) | 18.60gbit/s (2.92m req/s) | 5.26gbit/s (826.00k req/s) |
AMD Ryzen7 8845hs | 11.20gbit/s (1.75m req/s) | 22.70gbit/s (3.56m req/s) | 6.13gbit/s (962.62k req/s) |
Intel Xeon W-2223 | 7.47gbit/s (1.17m req/s) | 11.52gbit/s (1.80m req/s) | 4.16gbit/s (653.26k req/s) |
Intel Xeon Gold 6348 | 7.92gbit/s (1.24m req/s) | 12.48gbit/s (1.95m req/s) | 4.13gbit/s (648.55k req/s) |
Intel i7-1065G7 | 8.70gbit/s (1.36m req/s) | 12.67gbit/s (1.98m req/s) | 4.77gbit/s (749.05k req/s) |
AMD EPYC 9454P | 9.84gbit/s (2.86m req/s) | 17.10gbit/s (4.97m req/s) | 5.45gbit/s (1.58m req/s) |
Intel i9-12900k | 14.20gbit/s (2.22m req/s) | 24.55gbit/s (3.85m req/s) | 8.50gbit/s (1.33m req/s) |
Intel i9-13900hx | 16.10gbit/s (2.52m req/s) | 27.13gbit/s (4.26m req/s) | 7.02gbit/s (1.10m req/s) |
AMD Ryzen 9 9950X3D | 19.10gbit/s (2.99m req/s) | 29.32gbit/s (4.60m req/s) | 10.61gbit/s (1.66m req/s) |
My collection of parsed protocols (via vProto) includes: