vSyn.ru

Introduction
Range [a-zA-Z\x20\r\n]
Variables u8-64, h8-64, b8-64, array, string, string_view, bool, enum
"casesensitive text" or 'no-casesensitive'
{ c++ user code } notify:userFunction
begin:userFunction end:userFunction
label:name, jmp:name, call:name, return
reset, break, bang, back:depthNumber
<tokensCase1, tokensCase2, tokensCase3>
vproto:moduleName:varName
Optimization & Performance
Сooperation

Introduction

vProto - is an advanced description of regular expressions, allowing for the specification of a Finite State Machine, including states, behaviors, and transition conditions. This description serves as the basis for generating C++ code, tailored for data parsing purposes.

Highly convenient for analyzing/parsing existing protocols (applicable to both textual and binary) and for describing custom data structures
Supports stream processing of data with state preservation, enabling handling of data of any fragmentation (even byte-wise reception), which is crucial for TCP protocol
Capability for C++ embeddings for simplicity and flexibility in description
Wide range of applicability
High performance of the generated code, platform portability (in the future, code generation will also be possible for other programming languages)
The syntax used blends standard regular expressions with a human-friendly perception of finite state machines, transitions, and branches
There is support for C++98 mode to enable the generated code to work on embedded platforms, including microcontrollers. For example, I launched an HTTP server on STM32 with code generated by vProto (tested on IAR IDE 6.5)

State Machine descriptions are done using tokens. Tokens written on a single line denote sequential traversal (once one is completed, it moves to the next). Branching occurs when transitioning to a new line with an increase in depth (increase in indentation). Lines at the same depth correspond to branches. At the end of a line, there may be the presence of the symbol \ which symbolizes continuation of the line (including possible branching), and it will be used during transition in case branching is terminated.

Example:
    token_1_1 token_1_2 token_1_3 \
        token_2_1 token_2_2
        token_3_1 token_3_2 \
            token_4_1 token_4_2
            token_5_1 token_5_2
        token_6_1 token_6_2
    token_7_1 token_7_2 token_7_3
        token_8_1 token_8_2
        token_9_1 token_9_2

The token's name begins with the line number, followed by the token number within the line. Essentially, at the beginning of any description, the first token represents branching. In the example, this corresponds to either token_1_1 or token_7_1 (because they are at the same depth or indentation level). Let's assume a transition to token_1_1; then token_1_2 and token_1_3 will be executed sequentially, one after the other, since they are on the same line. After token_1_3, branching occurs to token_2_1, token_3_1, and token_6_1 because they are at the same depth. Let's say a transition occurs to token_3_1; then after its successful completion, there will be a transition to token_3_2, and then branching occurs again between token_4_1 and token_5_1 (as they are at the same depth). Assuming a transition to token_4_1, then next will be token_4_2, which marks the end of the branch. After token_4_2, a return back occurs. The return back continues until a \ symbol is found at the end of the branch. The presence of this symbol on line 3 indicates that the transition will be made back to line 1, and the presence of \ on line 1 returns to the beginning. So, upon completion of token_4_2, the transition will be to the beginning, to the branching of token_1_1 and token_7_1. Another behavior occurs upon completion of token_8_2 or token_9_2; for them, branching occurs on line 7, where \ is absent. This means that upon completion of lines 8 or 9, the transition will be made to the branching in line 7, i.e., between token_8_1 and token_9_1, creating a cyclic transition through branching.

The successful passage of a token (and consequently, transition to the next) depends on the incoming data. Sometimes tokens can be optional or have certain requirements for their repetition. It's possible to modify token traversal properties by adding additional options using parentheses, for example: (min=5, max=10, init=100).

min - the minimum number of bytes that should be used in this token (if less, then an exception)
max - the maximum number of bytes that should be used in this token (if the quantity is reached, then proceed to next)
init - the maximum size of stored data for a std::string
sse - SSE4.2 insertion (for range tokens only). Increase performance only if many bytes used in the node
nosse - dont insert SSE4.2 insertion (for range tokens only)

Shorter forms can also be used: ? (equivalent to min=0, max=1), * (equivalent to min=0, max=infinity), + (equivalent to min=1, max=infinity).

If incoming characters do not match the token, an exception will occur (preventing a transition to the next token), which can be caught; otherwise, parsing will exit. To catch the exception, you need to add a branch with 'catch:' specified. If a '\' character exists in the branch, it will be analyzed as part of the parent branch catch.

The demonstration of using examples can be found at the following URL

Range [a-zA-Z\x20\r\n]

Describes the range of permissible values for the incoming byte. The expected format is equivalent to standard range in regular expressions. This token supports:

special characters: [\r\n\t] (equivalent to: \x0D\x0A\x09)
the caret symbol (^) denotes negation. For example, [^\r\n] matches any character except \r\n
[a-zA-Z] matches any character within the specified range (including boundaries). If you want to enter the '-' symbol, then you need to use: \-
[\xHH] - specifying hexadecimal values. For example \x20 means 0x20 as the hexadecimal byte value.
Convenience of using modifiers: + ? * for example: [ ]+ finds matches in this token as long as spaces or tabs are received, but if this character does not come at least once, an exception will occur, and no further transition will occur

Also, this token supports redirection to a variable, meaning that while we are within this token, all data will be redirected to the specified variable (equivalent min=1, max=infinity). It is possible to use modifiers (string/string_view/array/uint/hex/bool/bin(binLe)/binBe) that will affect the variable initialization and assignment.

Example of redirecting to numbers (all numbers 0-9 will be written to the variable): [0-9]->uint:variable
Example of redirecting to string (all characters except \r\n will be written to the variable): [^\r\n]->string:variable
Example of redirecting to user function: [a-zA-Z0-9]->userFunction

Variables u8-64, h8-64, b8-64, array, string, string_view, bool, enum

As mentioned earlier, when initializing variables, the main type is a range, which can describe all variables. However, there are also shorter forms of variable notation:

u8:variable, u16:variable, u32:variable, u64:variable - equivalent: [0-9]->uint:variable and modifiers: (min=1) for u8 (1 byte), for u16 (2 bytes), etc
h8:variable, h16:variable, h32:variable, h64:variable - equivalent: [0-9a-zA-Z]->hex:variable and modifiers: (min=1) for h8 (1 byte), for h16 (2 bytes), etc
b8:variable, b16:variable, b32:variable, b64:variable - (little-endian) equivalent: [\x00-\xff]->bin:variable and modifiers: (min=1, max=1) for b8 (1 byte), (min=2, max=2) for b16 (2 bytes), etc
bBE8:variable, bBE16:variable, bBE32:variable, bBE64:variable - (big-endian) equivalent: [\x00-\xff]->bBE:variable and modifiers: (min=1, max=1) for bBE8 (1 byte), (min=2, max=2) for bBE16 (2 bytes), etc
bool:variable (the variable will be initialized as bool) - equivalent: [\x00-\xff]->bin:variable(min=1, max=1)
data:variable - equivalent: [\x00-\xff]->variable. It's often convenient to add a modifier: (max=count)
array:variable - equivalent: [\x00-\xff]->array:variable. Save data to array, it's often convenient to add a modifier: (init=count or max=count)
string:variable - equivalent: [\x00-\xff]->string:variable. Save data to std::string (allocation inside string and data copying), it's often convenient to add a modifier: (init=count)
string_view:variable - equivalent: [\x00-\xff]->string_view:variable. Save data to std::string_view (actually, it does not store data, it only performs markup)
enum{e1, e2, e3}:variable - initializing enum.

If you do not specify the ':', redirection to the variable will not occur, but the data will be ignored, equivalent to range without redirection.
Example of TLV parsing: b8:type b32:length data:value(max=length)
Example of UDP header parsing: bBE16:srcPort bBE16:dstPort bBE16:dataLength bBE16:checksum data:udpPayload(max=dataLength)

"casesensitive text" or 'no-casesensitive'

"casesensitive" or 'no-casesensitive' - describes constant words or binary values. For example, in the HTTP protocol, words like "GET" or "HTTP/1.1" are case-sensitive, while headers like 'Content-Length' or 'Content-type' can be written using uppercase or lowercase characters. You can also describe the binary value of a character, like in ranges.

{ c++ user code } notify:userFunction

{ c++ code } or notify:userFunction - calling user's C++ code, this token doesn't use incoming bytes but can be used for modifying variables, invoking callbacks, determining branching conditions, or during debugging stages. This code is called as a function, and inside it, the user can "return false" (by default, it automatically returns true) - indicating successful or unsuccessful token processing. Examples:
b8:type { printf("Read Type: %u ", type); }
    { return type == 1; } { userFunction1(); }
    { return type == 2; } { userFunction2(); }
    { return type == 3; } { userFunction3(); }
C++ insertion is used for debugging\printf function (automatically setting successful token processing return true), as well as for branching: the "return type == ..." returns true or false, thereby selecting a branch transition.

If a user wishes to add their own functions or repositories, they need to define the "OutputClassName". The parser will inherit from this class and access its variables and functions. It is recommended to inherit OutputClassName from the demoResult state structure.

notify:userFunction - essentially replaces the C++ insert { userFunction(); }, but shorter and more efficient in terms of performance.
Example TLV data struct: b8:type b32:length [\x00-\xff]->string:value(max=length) notify:gotValue

begin:userFunction end:userFunction

begin:userFunction end:userFunction - transfers to the user all data between the beginning and the end.
Example(will return to the user all the data that was used in the tokens between begin and end):
"GET" [ \t]+ begin:userFunction [^ \t]+ end:userFunction [ \t]+ "HTTP/" [0-9] "." [0-9] "\r"? "\n"

label:name, jmp:name, call:name, return

This tokens are used for transitions within the state tree:

label:labelName - the position 'labelName' to which the transition can be made.
jmp:labelName - transition to label:labelName, just jump to label
call:labelName - transition to label:labelName with position preservation upon return (essentially equivalent to calling a function). The return position can be used to exit the transition and continue analysis at the current position.
return - returning from a call uses the return position saved in the call invocation.

Example:
call:readRequestType call:readUrl call:readHeaders

label:readRequestType
"GET" return
"POST" return

label:readUrl [ \t]+ [^ \t]->string:url [ \t]+ "HTTP/" [0-9] "." [0-9] "\r"? "\n" return

label:readHeaders ...

reset, break, bang, back:depthNumber

reset - full reset of the state, all variables are cleared, transition is made to the initial stage. The parsing will continue
break - destruction of the current state. For instance, if parsing needs to be terminated due to an unhandled situation, this token allows the destruction of the current state. In the case of parallel states, only the current one will be destroyed; the others will continue functioning. If there are no other states, then a return from the State Machine parsing will occur.
bang - leaves only the current state intact; others are destroyed. This is relevant for parallel states. Essentially, this token indicates that after branching, we have chosen the active state
back:depthNumber - makes a transition to branching corresponding to the depth in depthNumber.

Example:
'A' // depth == 0
    'B' // depth == 1
        'C' back:1 // depth == 2, back:1 go to depth == 1, between 'B' and 'D'.
    'D' // depth == 1
        'E' back:0 // depth == 2, back:0 go to depth == 0, between 'A' and 'F'.
'F' // depth == 0

<tokensCase1, tokensCase2, tokensCase3>

<listTokensCase1, listTokensCase2,..> - this token allows creating nested branching variations within itself, describing a sequence of regular tokens and comma-separated alternative options. Its presence changes the logic of the entire state machine by creating parallel states, enabling simultaneous processing across multiple states. Completion of any sequence leads to the termination of other parallel states, essentially, the 'bang' token is automatically set after this token. This token significantly simplifies understanding of the finite state machine, but working in parallel states is always slower than working in a single state. It is recommended to minimize the description of variations in this token so that options are discarded as quickly as possible.
Example:
<"GET", "POST", "PUT"> - All three states are processed in parallel until one of them wins. If the byte G is received, then only the GET branch has a chance of passing; POST and PUT automatically fail. But if the byte P is received, then we will be simultaneously in the POST and PUT states, waiting for other symbols to determine the state.
Example (parallel states):
"GET" [ \t]+ \
    <'url-1'> [ \t]+
    <'url-2'> [ \t]+
    <'url-3'> [ \t]+
All three states url-1, url-2, url-3 are considered in parallel as data arrives (which can come byte by byte). Decisions are made about which states to exclude gradually.
Example (no parallel states):
"GET" [ \t]+ \
    'url-1' [ \t]+
    'url-2' [ \t]+
    'url-3' [ \t]+
Without using <>, branching will not be considered as parallel states, and transitions to url-2 and url-3 will not be possible because upon the arrival of the symbol 'u', the transition to the first, more prioritized branch url-1 (because it first) will be chosen, and if, for example, a url-2 comes next, the transition to url-2 will not occur. An exception will occur at the utl-1 branch.

vproto:moduleName:varName

vproto:moduleName:varName - the ability to delegate parsing to other vproto modules. Example: an initial parsing module handles protocols such as IMAP, POP3, or SMTP, and then passes the data to the common MIME module for further parsing. Another example is PCAP parsing, which consists of a separate module for reading the PCAP header (first module):
"\xd4\xc3\xb2\xa1" data(max=20)
    b32:ts_sec b32:ts_usec bBE32:pktLen b32 vproto:packet:var(max=pktLen)
followed by a module for parsing individual packets (second module name "packet"). For example:
array:macDst(max=6, init=6) array:macSrc(max=6, init=6) b16:protoL2 \
    { return protoL2 == 0x0800; // ipv4} ...
    { return protoL2 == 0x86dd; // ipv6} ...
Further parsing of the extracted data can also be delegated to subsequent modules. It is recommended to use it in combination with parameters like (max=length) or with ranges such as [a-z]->vproto:moduleName:varName

Optimization & Performance

Special attention is given to the performance of the generated code, even without special instructions (which exist in x86-64), to ensure code universality and its operation across multiple platforms.

Recommendations for achieving maximum performance:

Avoid creating parallel states (<...> token), manually resolve parallel states
To strive to minimize the number of states, especially transitions between them
Use g++ version 11 and higher with the -O3 flag
Use SSE4.2 (flag: '-msse4.2' or '-march=native') for range tokens if they consume an average of 8 or more bytes (CPU dependent optimization). If fewer bytes are used, it might negatively affect performance
Use profiling (code optimization for test data):

g++ -O3 -g -fprofile-generate=gcda ...
g++ -O3 -g -fprofile-use=gcda ...

Performance Testing involves running the most typical GET request (430 bytes) looped 100 million times (43 GB total), with all processing done on a SINGLE CORE IN ONE THREAD. Comparison is made against a modified description of the HTTP protocol (manual parsing of parallel states) and the Boost::HTTP library. The Valgrind report is also attached for a loop of 1 million iterations

	vProto	vProto-SSE4.2	boost-1.5::http
ARM-8 rev1 2200mhz	11.51gbit/s (3.34m req/s)	doesn't support	3.43gbit/s (997.09k req/s)
AMD Ryzen9 5950X	13.28gbit/s (3.86m req/s)	15.60gbit/s (4.53m req/s)	5.19gbit/s (1.50m req/s)
AMD Ryzen7 8845hs	29.77gbit/s (8.65m req/s)	41.90gbit/s (12.18m req/s)	5.48gbit/s (1.59m req/s)
Intel Xeon W-2223	13.74gbit/s (3.99m req/s)	25.50gbit/s (7.41m req/s)	4.13gbit/s (1.20m req/s)
Intel Xeon Gold 6348	16.85gbit/s (4.89m req/s)	26.30gbit/s (7.64m req/s)	5.44gbit/s (1.58m req/s)
Intel i7-1065G7	18.53gbit/s (5.38m req/s)	27.00gbit/s (7.84m req/s)	4.45gbit/s (1.29m req/s)
AMD EPYC 9454P	21.14gbit/s (6.14m req/s)	31.30gbit/s (9.09m req/s)	6.10gbit/s (1.77m req/s)
Intel i9-12900k	31.22gbit/s (9.07m req/s)	40.80gbit/s (11.86m req/s)	9.88gbit/s (2.87m req/s)
Intel i9-13900hx	27.50gbit/s (7.99m req/s)	41.70gbit/s (12.12m req/s)	9.73gbit/s (2.82m req/s)
AMD Ryzen 9 9950X3D	43.70gbit/s (12.70m req/s)	59.30gbit/s (17.23m req/s)	13.60gbit/s (3.95m req/s)

The same test, but after using profiling:

	vProto	vProto-SSE4.2	boost-1.5::http
ARM-8 rev1 2200mhz	13.57gbit/s (3.94m req/s)	doesn't support	4.42gbit/s (1.28m req/s)
AMD Ryzen9 5950X	13.77gbit/s (4.00m req/s)	16.30gbit/s (4.73m req/s)	6.51gbit/s (1.89m req/s)
AMD Ryzen7 8845hs	33.23gbit/s (9.65m req/s)	57.60gbit/s (16.74m req/s)	6.87gbit/s (1.99m req/s)
Intel Xeon W-2223	17.72gbit/s (5.15m req/s)	29.30gbit/s (8.51m req/s)	5.11gbit/s (1.48m req/s)
Intel Xeon Gold 6348	18.63gbit/s (5.41m req/s)	32.80gbit/s (9.53m req/s)	5.91gbit/s (1.71m req/s)
Intel i7-1065G7	19.91gbit/s (5.78m req/s)	36.50gbit/s (10.61m req/s)	5.32gbit/s (1.54m req/s)
AMD EPYC 9454P	25.73gbit/s (7.47m req/s)	44.20gbit/s (12.84m req/s)	8.73gbit/s (2.53m req/s)
Intel i9-12900k	39.67gbit/s (11.53m req/s)	62.00gbit/s (18.02m req/s)	11.47gbit/s (3.33m req/s)
Intel i9-13900hx	40.53gbit/s (11.78m req/s)	63.70gbit/s (18.51m req/s)	11.59gbit/s (3.36m req/s)
AMD Ryzen 9 9950X3D	44.20gbit/s (12.84m req/s)	72.00gbit/s (20.93m req/s)	14.78gbit/s (4.29m req/s)

On average, code generated by vProto runs 3 times faster than Boost::HTTP and 5-6 times faster when using SSE4.2
To achieve the same parsing of HTTP requests, vProto generates three times fewer instructions compared to Boost::HTTP
The impact of profiling on Intel architecture is more significant
The impact of SSE4.2 on AMD is almost negligible (for this specific CPU), while it is significant for Intel
The number of branches constitutes approximately 25% of all executed instructions
The overall branch misprediction rate is not high at 5.7%, with an expectedly high percentage of 85% being incorrect Indirect branch (dynamic branch predictions based on computed addresses), but it does not significantly affect the overall performance
Data caching is extremely high; we are not testing data access, but rather the efficiency of the generated code and its states. Once in a state and without changing it, the performance practically does not depend on whether the input data changes or not
Transitions between states are important during data parsing. For example, if the URL in a GET request is 150 bytes (35% of the total request size), then performance will improve by 10% because there will be fewer state transitions or changes
On average, without SSE4.2, 1 byte is processed in 6.2 instructions (2644 million instructions / 430 million bytes). However, it's important to understand that perfomance of SSE4.2 instructions is significantly slower than that of simple instructions, and the efficiency of using SSE4.2 has a more substantial impact on the CPU pipeline.
On average, with SSE4.2, 1 byte is processed in 3.4 instructions (1468 million instructions / 430 million bytes)

Another test compares the generated JSON code (from example) with RapidJSON. The test uses a typical JSON of small size, 796 bytes. A distinguishing feature of JSON compared to HTTP is the constant transitions between states, whereas in the case of HTTP, we spend the majority of time in a single state. In this example, used with relatively short data, the generated JSON code runs about 10% faster with SSE4.2 (RapidJSON also utilizes SSE4.2). If the fields in JSON are longer, the contribution to performance from SSE4.2 will be even greater. The Valgrind report is also attached for a loop of 1 million iterations

	jsonFlow	jsonPerf	rapidJson-v1.1.0
ARM-8 rev1 2200mhz	4.91gbit/s (771.04k req/s)	8.73gbit/s (1.37m req/s)	4.11gbit/s (645.41k req/s)
AMD Ryzen9 5950X	6.10gbit/s (957.91k req/s)	16.20gbit/s (2.54m req/s)	6.33gbit/s (994.03k req/s)
AMD Ryzen7 8845hs	11.00gbit/s (1.72m req/s)	19.77gbit/s (3.10m req/s)	9.44gbit/s (1.48m req/s)
Intel Xeon W-2223	7.31gbit/s (1.14m req/s)	9.13gbit/s (1.43m req/s)	4.88gbit/s (766.33k req/s)
Intel Xeon Gold 6348	7.18gbit/s (1.12m req/s)	11.14gbit/s (1.74m req/s)	5.83gbit/s (915.51k req/s)
Intel i7-1065G7	7.54gbit/s (1.18m req/s)	11.20gbit/s (1.75m req/s)	8.45gbit/s (1.32m req/s)
AMD EPYC 9454P	8.13gbit/s (2.36m req/s)	14.72gbit/s (4.27m req/s)	9.57gbit/s (2.78m req/s)
Intel i9-12900k	11.09gbit/s (1.74m req/s)	18.41gbit/s (2.89m req/s)	12.29gbit/s (1.92m req/s)
Intel i9-13900hx	13.22gbit/s (2.07m req/s)	22.64gbit/s (3.55m req/s)	13.27gbit/s (2.08m req/s)
AMD Ryzen 9 9950X3D	17.00gbit/s (2.66m req/s)	26.81gbit/s (4.21m req/s)	15.60gbit/s (2.44m req/s)

The same test, but after using profiling:

	jsonFlow	jsonPerf	rapidJson-v1.1.0
ARM-8 rev1 2200mhz	5.14gbit/s (807.16k req/s)	9.25gbit/s (1.45m req/s)	2.93gbit/s (460.11k req/s)
AMD Ryzen9 5950X	6.07gbit/s (953.20k req/s)	18.60gbit/s (2.92m req/s)	5.26gbit/s (826.00k req/s)
AMD Ryzen7 8845hs	11.20gbit/s (1.75m req/s)	22.70gbit/s (3.56m req/s)	6.13gbit/s (962.62k req/s)
Intel Xeon W-2223	7.47gbit/s (1.17m req/s)	11.52gbit/s (1.80m req/s)	4.16gbit/s (653.26k req/s)
Intel Xeon Gold 6348	7.92gbit/s (1.24m req/s)	12.48gbit/s (1.95m req/s)	4.13gbit/s (648.55k req/s)
Intel i7-1065G7	8.70gbit/s (1.36m req/s)	12.67gbit/s (1.98m req/s)	4.77gbit/s (749.05k req/s)
AMD EPYC 9454P	9.84gbit/s (2.86m req/s)	17.10gbit/s (4.97m req/s)	5.45gbit/s (1.58m req/s)
Intel i9-12900k	14.20gbit/s (2.22m req/s)	24.55gbit/s (3.85m req/s)	8.50gbit/s (1.33m req/s)
Intel i9-13900hx	16.10gbit/s (2.52m req/s)	27.13gbit/s (4.26m req/s)	7.02gbit/s (1.10m req/s)
AMD Ryzen 9 9950X3D	19.10gbit/s (2.99m req/s)	29.32gbit/s (4.60m req/s)	10.61gbit/s (1.66m req/s)

On average, code generated by vProto (jsonPerf) almost runs 2 times faster than rapidJson
The difference between jsonFlow and jsonPerf lies in the use of std::string or std::string_view respectively. In the first case, it will work the same regardless of fragmentation (data can arrive byte by byte). In the second case, it only works upon completion of the entire JSON file
The demonstrative versatility of vProto, along with its extremely rapid type replacement in the description, creates an entirely different functionality
RapidJSON operates only in full JSON file mode (no flow mode)
The jsonFlow operates slower because memory allocation and data copying into std::string occur
The use of profiling has a negative impact on RapidJSON but a positive impact on the code generated by vProto (similar to HTTP)
The number of branches and their predictions are quite similar to HTTP parser
The number of operations executed on jsonPerf code in 2.5 time less (5938 million vs 15029 million for 1 million loop iterations)
On average, with SSE4.2, vProto-generated code for jsonPerf consumes 7.4 instructions per byte of data (5938 million / 796 million bytes), whereas RapidJSON consumes 18.8 instructions (15029 million / 796 million bytes)
The efficiency of the pipeline within the CPU has a significant impact on performance, and this is especially noticeable when using a profiler for an AMD processor (a threefold increase in speed by reordering instructions)

Сooperation

My collection of parsed protocols (via vProto) includes:

http1-2, https, ftp, imap, smtp, pop3, mime
bgp, igrp, ospf, netflow (v5-v10), ipfix, radius, diameter, snmp
sip, h323, iax2, sigtran, abis, ranap, megaco, mgcp, skinny, sdp
xmpp, jabber, msn, irc, yahoo, icq, mra
json, xml, asn1, bitTorrent

I am open to cooperation or participating in projects.