| Issue |
Security and Safety
Volume 4, 2025
Security and Safety for Next Generation Industrial Systems
|
|
|---|---|---|
| Article Number | 2025012 | |
| Number of page(s) | 37 | |
| Section | Industrial Control | |
| DOI | https://doi.org/10.1051/sands/2025012 | |
| Published online | 28 October 2025 | |
Review
Reverse engineering of industrial control protocol: A survey
1
State Key Laboratory of Public Big Data, School of Computer Science and Technology, Guizhou University, Guiyang, 550025, China
2
College of Control Science and Engineering, Zhejiang University, Zhejiang, 310027, China
* Corresponding authors (email: This email address is being protected from spambots. You need JavaScript enabled to view it.
)
Received:
11
April
2025
Revised:
7
June
2025
Accepted:
12
September
2025
Industrial control systems (ICSs) are designed for monitoring and controlling industrial processes, enabling the automation and management of critical sectors such as production, manufacturing, and power system through electronic devices and communication infrastructure. Industrial control protocols (ICPs) refer to the standardized rules and formats used for communication. Protocol reverse engineering (PRE) refers to the process of inferring the structure, semantics, and behavior of a communication protocol in the absence of official specifications or documentation. Given the prevalence of proprietary protocols in ICS and the limited formal documentation, PRE is an important method for understanding and managing protocol behavior in complex heterogeneous industrial environments. Over the past decades, ICSs have typically operated within isolated and closed network environments, where many protocol specifications remained proprietary and unknown, thereby hindering the evaluation of protocol security. This limitation has driven the development of reverse engineering approaches for ICPs. To systematically summarize the current research results and development of ICP reverse engineering, we build a complete technical framework about typical objectives of protocol reverse engineering. The existing methods are summarized in seven aspects, including data acquisition, message clustering, field division, key field identification, field semantic derivation, state machine modeling, and application. The common problems and limitations are discussed, and finally, combined with future implementation needs, we propose several research directions worthy of attention.
Key words: Industrial Control Systems / Cybersecurity / Protocol Reverse Engineering
Citation: Wu Y, Zhang Z, Hetu Z, Cheng X and Cheng P. Reverse engineering of industrial control protocol: A survey. Security and Safety 2025; 4: 2025012. https://doi.org/10.1051/sands/2025012
© The Author(s) 2025. Published by EDP Sciences and China Science Publishing & Media Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Compared with conventional information systems, industrial control systems (ICSs) are characterized by stringent requirements for real-time performance and reliability [1]. Due to their frequent deployment in complex and decentralized industrial environments, their communication protocols and security mechanisms tend to exhibit a high degree of diversity and proprietary design. With the ongoing advancement of industrial digitalization and networking, ICSs are increasingly integrated into large-scale interconnected architectures, facing rising challenges in compatibility, security, and scalability [2].
Over the past few decades, ICSs have replaced bulky combinations of preset circuit-based control methods–comprising timers, counters, and relay switches–with more efficient systems capable of managing increasingly complex industrial processes [3]. ICS refers to an integrated system applied to the monitoring and control of industrial production processes. The system typically encompasses a multi-layered architecture consisting of sensors, actuators, programmable logic controllers (PLCs), remote terminal units (RTUs), and human-machine interfaces (HMIs). Its primary function is to collect and analyze critical parameters (such as temperature, pressure, and flow rate) from industrial environments in real time, and subsequently perform regulation and decision-making through automated or semi-automated means, thereby ensuring the efficiency and stability of production processes.
Since 1968, when Dick Morley designed the first PLC, the scale and complexity of ICS have continued to grow. Along with this expansion has come increasing interconnectivity among ICS. To meet the demands of large-scale and continuous production, supervisory control and data acquisition. Like distributed control systems (DCS) [4, 5], they have been widely deployed in geographically dispersed sectors such as electric power [6], oil pipelines [4], and water resources [7]. These systems rely on remote terminal units (RTUs) and communication links to enable real-time data acquisition and centralized management. However, the industrial control protocols (ICPs) remain largely proprietary and are primarily intended for internal industrial use [8, 9].
As of 2024, the global ICS market has reached USD 206.33 billion and is projected to continue expanding at a compound annual growth rate of 10.8% [10]. ICS is now deeply integrated with enterprise networks, making it possible to leverage VPNs, mobile devices, and cloud platforms for remote monitoring, maintenance, and big data analytics. Open-sourced protocols such as OPC UA are increasingly used for cross-platform data exchange and information integration, emphasizing the use of big data, cloud computing, and artificial intelligence (AI) technologies to improve production efficiency and decision-making levels. As a result, concepts such as “smart manufacturing”, “Industrial Internet of Things (IIoT)” and “Industry 4.0” have emerged [11].
Glossary of abbreviations
1.1. Terminology and abbreviations
The terminology and abbreviations used in this paper are summarized in Table 1.
1.2. Security challenges
The rapid development of ICS has brought with it challenges such as security protection, real-time performance, and reliability. The expansion of system scale and the diversification of equipment types have also made compatibility issues more prominent. In the context of the continued interweaving of new technologies, the focus of ICS protection has also changed with the changes in the environment.
In the past, a widely adopted and reliable security approach in ICS was to minimize system visibility by concealing device interfaces from external access and utilizing proprietary transmission media. However, as ICSs have become increasingly integrated with enterprise networks, the Internet of Things, and cloud platforms, their previously closed nature has been eroded, bringing significant network exposure risks [12]. Early ICSs were typically designed for relatively isolated environments and lacked comprehensive security mechanisms or encryption capabilities, rendering them defenseless against modern cyberattacks [13]. As these systems are connected to open networks, the attack surface expands [14], and intruders can attack through remote access, virtual private network (VPN), or firmware vulnerabilities [15]. The mix of old and new equipment and technologies from different manufacturers also leads to potential vulnerabilities in the system that are not fully protected or audited. Coupled with the high requirements for real-time operation and reliability, it is difficult to implement frequent patching or system upgrades, thereby increasing the risk of damage caused by external attacks.
Compatibility issue is another issue brought about by the development of ICS. In practice, there are differences in equipment from different manufacturers and eras, proprietary protocols or communication interfaces of old standards [16], data formats and even security mechanisms, which frequently cause obstacles such as data not being recognized and instructions not being executed when the system is upgraded or interconnected [17]. Since ICS has extremely high requirements for real-time operation and stability, the use of protocol conversion or bridging adaptation layers may introduce additional delays and failure risks. In addition, many protocols lack complete documentation or public standards, and new and old equipment are more likely to conflict in implementation details, thereby increasing the difficulty of system integration and operation and maintenance. Once a compatibility failure occurs, it will not only affect production efficiency, but may also cause safety hazards. The lack of understanding of the protocol message structure, instruction set and interaction timing, and the failure to carry out targeted data format conversion and compatibility correction when implementing adaptation conversion are the main causes of failures and deviations. For instance, two devices may support the same protocol (e.g., Modbus), but adopt different register addressing schemes, field padding strategies, or byte order conventions, leading to misinterpretation of messages. In some cases, partial protocol implementations omit optional fields or commands, resulting in message parsing failures when communicating with fully-featured counterparts. A typical scenario is a third-party HMI attempting to communicate with a PLC using a proprietary extension of a standard protocol–without access to the full protocol specification, the HMI may be unable to initiate valid queries or interpret responses correctly, causing silent failures or incorrect system behavior. These compatibility barriers not only increase integration complexity but also limit the effectiveness of generic security monitoring and reverse engineering approaches, which often assume uniform protocol behavior. Therefore, addressing protocol-level incompatibilities is essential for ensuring reliable communication, effective monitoring, and scalable reverse engineering in complex ICS environments.
Closed protocols refer to proprietary communication protocols whose message formats, field semantics, and operation logic are not publicly documented or standardized. Closed protocols are often developed by specific vendors for exclusive use in their hardware or software products. These protocols may involve obfuscated field structures, encrypted payloads, or undocumented control commands, making them inaccessible to third-party developers and researchers. In reverse engineering, this lack of transparency poses significant challenges, including difficulty in identifying field boundaries, reconstructing state machines, and interpreting control behaviors without access to source code or vendor documentation. In addition, closed protocols may contain dynamic field layouts or encode logic-dependent state transitions that cannot be reliably inferred from passive traffic alone. These characteristics require the development of protocol reasoning techniques that are resilient to unknown structures and adversarial analysis designs, increasing the importance of research into fully automated and assumption-light reverse engineering methods.
In the current ICS environment, the widespread use of closed-source and proprietary protocols poses considerable challenges to security. Undocumented or hidden fields may carry critical control parameters that have not been properly verified, increasing the risk of misuse or unauthorized operation; the lack of encryption or authentication mechanisms in communication protocols makes them vulnerable to eavesdropping, replay, or man-in-the-middle attacks. The overall lack of transparency in protocol logic hinders effective auditing, anomaly detection, or intrusion prevention. As a result, defenders cannot fully evaluate protocol behavior or develop general protection measures, while attackers can exploit these blind spots to reverse engineer protocol logic and design customized attacks with minimal resistance. Protocol reverse engineering is a key technical method to reveal hidden protocol structure and semantics, enabling better visibility, interoperability, and the development of active security mechanisms in closed industrial environments.
On the other hand, except external attacks and interconnection between systems, ICSs also face internal risks generated by upgrade. The original industrial control equipment mostly operated in an isolated environment, and transmission security was not considered a core element during design. The firmware of old PLCs or RTUs could not be updated in a timely manner, or lacked patch support, becoming potential vulnerabilities that are more difficult to detect and repair. Especially in some industries with extremely long production cycles, companies often have to continue to use the system or equipment that have been in service for many years. Even in the absence of obvious malicious motives, unpredictable process anomalies and safety accidents may occur due to improper operation by engineers [18]. It is worth noting that most ICS, due to their focus on continuous production and costs, prefer to maintain the original state and are unwilling to frequently update security strategies. This also causes internal risks to accumulate in long-term operation, forming a state that appears stable on the surface but actually hides crises [19].
1.3. Significance and objectives of protocol reverse
In the face of the above risks, Protocol Reverse Engineering (PRE) is a viable method to understand and protect ICS, helping security researchers and engineers to recover communication and protocol formats without official support, implement behavior modeling, anomaly detection and security hardening, thereby improving the security and efficiency of ICSs. Specifically, the main goals of protocol reversal are as follows:
-
Restore the protocol format and state transition mechanism.
-
Find out the meaning of protocol messages, function types or tool chain sources.
-
Analyze potential risks and unknown vulnerabilities of the protocol system.
-
Detect and defend against errors and attacks in the operation of the protocol system.
In summary, protocol reverse engineering plays a positive role in enhancing the security and functionality of ICSs. It is an important method to address current industrial control security risks, enabling security researchers and engineers to better understand ICS and protect them from potential threats. All of this helps to improve the overall reliability and security of industrial operations. Protocol reverse engineering has become an important tool to ensure the stable and continuous operation of critical infrastructure in an increasingly interconnected and vulnerable network environment.
1.4. Contribution and paper structure
Therefore, in this paper, we conduct a survey to analyze and compare the existing studies about the reverse engineering of ICPs. To illustrate the differences between this article and the existing review literature, Table 2 provides a comparison with several representative surveys in this field [20–22].
Comparison of our work and the existing work
As shown in Table 2, this paper differs from previous surveys by focusing specifically on ICPs rather than general-purpose communication protocols. Unlike previous work that emphasize tool-based classification or general format and state machine reconstruction, our study proposes a fine-grained classification framework built around seven key technical goals in the reverse engineering process. In addition, we cover a wider range of recent literature up to 2025, including traditional methods and emerging methods based on machine learning, symbolic execution, and firmware analysis. This allows the survey to provide more actionable insights and research directions, especially for practitioners and researchers working in the specific context of ICSs.
In summary, the main contributions of this study are summarized as follows:
-
A first protocol reverse survey focusing on ICPs, with the aim of highlighting the uniqueness of industrial control scenarios.
-
A systematic analysis and comparison of existing research efforts based on seven objectives for the reverse engineering of ICPs.
-
A deep discussion of the limitations of the existing work is presented to promote a practical, extensible, and semantically rich reverse engineering framework for ICS protocols.
The subsequent content of this article is organized as follows: Chapter 1 introduce the challenges faced by ICS, Chapter 2 introduces the common protocol structure of industrial control, Chapter 3 analyzes the characteristics and functions of protocol reverse engineering, and Chapter 4 is the main content of this article, in which a technical framework covering the entire process of protocol reverse engineering is constructed, and based on this, the existing research is divided into seven categories according to the specific dimensions it focuses on, including: data acquisition and preprocessing, message clustering, protocol field division, key field identification, field semantic derivation, protocol state machine generation and protocol reverse application, and the core methods and representative work, technical advantages and applicable scenarios adopted in different stages are systematically summarized. Finally, at the end of the article, based on the shortcomings of current research and the bottlenecks of existing work, this article proposes key research directions for future protocol reverse engineering in terms of adaptability to weak hypothesis environments, semantic recognition depth and tool integration capabilities.
2. Characteristics of typical ICPs
In the reverse engineering of ICPs, typical ICPs have some characteristics that are significantly different from ordinary IT communication protocols. For example, they have high requirements for real-time transmission and uninterrupted transmission, and ICPs are binary and usually do not transmit text information for the sake of transmission efficiency [23]. These characteristics have a certain impact on the development of protocol reverse engineering [24]. In view of the limitations of the actual controllers and actuators of industrial control equipment on data processing capabilities, ICPs are designed to simplify the message structure as much as possible to reduce the computing pressure of the underlying firmware. At present, the protocols that are more commonly used in industrial control environments mainly include Modbus protocols applicable to general fields [25], DNP3 [26], IEC 60870-5-104 [27] and other protocols for remote communication, and S7Comm [28], Profinet [29] and other protocols for controller interconnection. We have sorted out the typical protocols in the field of industrial control and their application scenarios and characteristics as shown in Table 3.
Overview of typical ICPs
Analyzing and summarizing the characteristics of typical ICPs helps to understand the purpose and method of ICP reversal. Next, this article takes the classic DNP3 protocol as an example to illustrate the typical structural composition characteristics of ICPs.
The DNP3 protocol is a multi-layer communication protocol suitable for firmware message transmission in industrial automation environments, and is commonly used in industries such as electricity, water conservancy, and transportation. It can realize functions such as data grouping, transmission verification, and link control, and ensure the accuracy of the transmission content to a certain extent through CRC verification [30, 31]. The DNP3 protocol structure is mainly composed of the following four parts: the data link layer protocol that defines the basic mechanism of communication between physical devices, the transmission function protocol that marks long data packets and reassembles them at the receiving end; the application layer protocol that defines the instruction interaction structure between the master station and the slave station; and it defines the data object library of various data types used in the protocol [32].
Figure 1 shows a typical DNP3 message structure. The first two bytes of the message, 05 and 64, are the fixed start fields of the DNP3 message, marking the beginning of the frame message. The third byte, 0A, indicates that the total length from the Link Control byte to the end of the link layer is 10 bytes. The fourth byte is the link control byte C4, which we specifically expand as Figure 2.
![]() |
Figure 1. DNP3 message structure |
![]() |
Figure 2. Link control structure |
The DIR and PRM bits are set to 1, indicating that the message is initiated by the master station and directed to the slave station; the CON bit is set to 1, indicating that the slave station needs to confirm, while FIR and FIN are set to 0, indicating that the frame is not a segmented message. FUNC is set to 0, indicating that user data is sent (no response data), but since CON=1, the slave station will still be required to confirm the frame back. In Figure 2, bytes 5 to 8 are the destination address and source address, respectively, and these addresses are encoded in little-endian mode. Bytes 9 and 10 are the link layer checksum (CRC), which is used to verify the correctness of the previous field. The above is the Data Link Layer. The following Transport Layer has only 1 byte, which describes the packetization of the message link layer. The Application Layer contains bytes 12 and 13, of which the APP control byte describes the packetization of the message application layer. The Function Code byte is the function code, where 01 means that the message requests a read operation. Finally, there is the Data Object Library. Group indicates the object type, such as Binary Input, Analog Output, etc.; Variation indicates the data representation method of the object (whether it has a timestamp, data bit width, etc.); Qualifier indicates the structure of subsequent data, such as whether it contains an index, the number of values, etc.; Object Index and Object Data indicate the address and value of the field.
Through this field structure, DNP3 achieves unified modeling of different types of industrial control data (such as telemetry, control instructions, etc.). In protocol reversal, identifying these structures are crucial to understand the semantics conveyed in the message.
3. Characteristics of ICS
Conventional network security tools and methods of protecting information systems can play a certain role in dealing with ICS security risks, but due to the special nature of ICS, these methods may not achieve the desired effect, which is mainly due to the following characteristics of ICS.
Firstly, industrial control equipment is highly coupled with the production process. Any shutdown operation, traffic interruption or hardware replacement may have a significant impact on the overall production process, and even lead to economic losses or production accidents [33]. Therefore, frequent use of traditional security strategies (such as deploying firewalls, batch patch updates or long-term scanning) is often not feasible, and security protection is difficult to achieve the expected effect in such an environment [34].
Secondly, ICS equipment was limited by cost and scale at the beginning of its design, and its support for high-load computing or complex algorithms is relatively limited. As the service time of the equipment increases, the performance of key components gradually ages, and the processing capacity is further weakened, making it difficult to support the continuous operation of high-intensity security controls such as deep traffic analysis and dynamic behavior detection [35].
Thirdly, in industrial production sites, especially when dealing with sudden safety accidents or emergencies, operators need to issue immediate instructions to the system, and the operator’s emergency instructions need to be immediately accepted by the system. Overly complex authentication or authorization schemes may be harmful [36].
When traditional security means are difficult to cope with the many limitations exposed in ICS, turning to protocol reverse engineering has become a necessary way to deeply explore potential vulnerabilities and improve the overall protection level of the system [37]. Through reverse engineering, we can deeply understand the communication process, instruction set and verification method of various proprietary or old protocols, find out potential vulnerabilities such as plain text transmission, insufficient authentication mechanism or no security verification, and provide a basis for targeted reinforcement or updating of security modules. The accurate data format and timing characteristics revealed by protocol reverse engineering can help security tools (such as intrusion detection systems and vulnerability scanners) formulate more accurate monitoring and protection strategies and identify abnormal data packets and malicious instructions [38]. At the same time, for configuration vulnerabilities and compatibility conflicts caused by lack of or incomplete documentation, protocol reversal can enable R&D personnel and operation and maintenance teams to efficiently master the implementation details of the protocol, thereby reducing the security risks caused by incorrect configuration [39]. In general, protocol reversal provides a deeper and more comprehensive technical understanding foundation for ICS security, which can not only drive vulnerability repair and security enhancement, but also improve the control over the overall security situation of the system.
4. ICP reverse framework
The process of the common ICP reverse method has certain regularity, such as first obtaining and preprocessing the data source, clustering messages according to certain rules, dividing the protocol fields, identifying key fields, deducing field semantics, generating the protocol state machine, and verifying and fuzz testing the protocol behavior [40, 41]. The ICP reverse engineering method usually focuses on a certain dimension in the reverse process to propose innovative or optimized methods for the type of target data source [42]. Figure 3 summarizes common data sources for protocol reversal and some analysis methods corresponding to these data sources.
![]() |
Figure 3. Overview of ICP reverse engineering |
This paper summarizes the technical route of ICP reverse according to the protocol reverse dimension focused on by the reverse method as shown in Figure 4. First of all, the beginning of the reverse work is to determine the reverse analysis object and data source. Intercepting network transmission messages are direct and effective, but the control logic and system log records contained in the program execution code can also provide accurate information [43, 44]. These data constitute the basic input of protocol reverse. After preprocessing and message clustering, these inputs officially enter the stage of deciphering the protocol content. Field segmentation is usually the first step in this stage. The system divides bytes into fields by analyzing and comparing the features contained in the data or other heuristic methods. Among them, the key fields that have a more significant impact on the protocol receive more attention. Furthermore, the reverse analysis framework infers the meaning of these fields through methods such as Byte-level mutation analysis, thereby identifying the types of fields commonly included in various protocols, such as function codes, addresses, and data payloads. After obtaining the structure and meaning of the fields, we can start analyzing the interaction logic and state transition relationship of the protocol, which is also called generating a protocol state machine [45]. Ultimately, the results of these protocol reverse engineering will be put into practical uses that are beneficial to the security of ICS, such as protocol consistency checking and ICS security protection [46–49].
![]() |
Figure 4. The framework for ICP reverse engineering |
4.1. Data acquisition and preprocessing
Data acquisition and preprocessing are the first step in protocol reverse engineering, which aims to collect protocol interaction data that can be used for analysis, and provide a data source and analysis basis for subsequent work. There are two main types of data sources for ICPs. One is the network communication traffic for industrial equipment interactions during protocol operation, and the other is static data such as protocol implementation programs or program operation records. For different data sources, appropriate extraction methods should be used and certain preprocessing should be performed to convert them into data forms suitable for further analysis, thereby achieving the integration and construction of high-quality protocol reverse data sources.
4.1.1. Network traffic data
The traffic data of ICP is usually transmitted through industrial networks. Passive traffic capture is a way to obtain industrial network traffic data. This method captures the original data packets of the protocol through traffic capture tools such as Wireshark and tcpdump, or uses dedicated hardware devices to physically access the network link to export target data without interfering with the normal operation of the ICS. The communication data of the captured industrial equipment can be further used for subsequent protocol analysis, state modeling, security detection and other purposes.
In this regard, Marcin Nawrocki et al. conducted large-scale long-term Internet mapping and passive traffic monitoring through IXP and ISP traffic mirroring, identifying ICS devices exposed to the Internet worldwide, obtaining continuous PCAP format traffic data for up to 6 months, and using heuristic methods such as port numbers to screen out periodic and stable session traffic as industrial production traffic [50]. This can be further used in industrial control security research. This method obtains a large amount of analysis data entirely through passive traffic acquisition and deploying passive probes at traffic exchange nodes, demonstrating the advantages of passive detection in terms of concealment and obtaining real data, while also revealing the network exposure risks of a large number of ICS devices.
Passive monitoring and capture can provide protocol interaction data in real environments on a larger scale, with greater concealment and flexibility, and less prior knowledge of protocol formats. However, complex ICSs may lack complete protocol coverage, and long-term monitoring is required to obtain more comprehensive protocol samples. Actively constructed interaction tests can obtain the required server response by sending targeted constructed data packets to ICS devices, thereby obtaining more targeted protocol interaction data. In response to this research, Zhengxiong Luo et al. proposed a protocol reverse engineering method based on dynamic reasoning (DYNPRE), which dynamically communicates with the target protocol server through active interaction to obtain richer protocol information and improve the accuracy of parsing [51]. This method learns interaction rules in the initial traffic, such as determining mutable fields such as session ID and token to maintain the validity of the message. Then DYNPRE constructs messages that conform to known protocol specifications through byte-level micro-variations, continuously communicates with the server and observes the server response, and iteratively updates the message modification strategy. As a result, DYNPRE can obtain richer protocol information, break through the limitations of relying solely on static traffic analysis, and improve the accuracy and automation of protocol parsing.
The dynamic inference strategy driven by active interaction can make up for the inherent limitations of traditional network traffic-based protocol reverse engineering methods in terms of insufficient sample coverage, missing state information, and weak field semantic inference. However, While active interaction techniques provide a more detailed and targeted source of data, sending carefully crafted or mutated messages to devices without a full understanding of the protocol semantics can trigger unexpected actions, such as changing setpoints, initiating emergency states, or even stopping the production process. In addition, real-world industrial control system environments often have strict timing, safety interlocks, and proprietary behaviors, where unexpected interactions can cause equipment failure or process instability. Therefore, active testing must be used with caution–preferably in a sandbox, simulation, or digital twin environment–to avoid disrupting physical operations or compromising safety.
Digital twins play a positive role in overcoming the high security constraints of real-world industrial systems, limited device access, and potential risks of system disruption during reverse engineering. A digital twin refers to a high-fidelity virtual representation of a physical industrial system that is able to replicate its control logic, communication behavior, and operational dynamics in real time. By integrating protocol interaction models and device simulators into simulation frameworks, reverse engineering experiments can be conducted safely, repeatably, and at scale without interfering with the actual production environment. These platforms allow for the injection of malformed messages, monitoring of semantic effects, and dynamic observation of control flows, enabling behavior-based protocol reasoning, anomaly detection, and fuzz testing under controlled conditions. In addition, the simulation environment can be configured to mirror a variety of deployment scenarios, including network topologies, timing constraints, and fault conditions, which significantly enhances the transferability and robustness of reverse engineering approaches. Future work should explore how to combine reverse engineering tools with digital twin platforms and investigate how simulation feedback loops can assist in protocol state modeling, field semantic derivation, and verification of reverse outputs.
There is a special type of traffic in the traffic data of the ICP, that is, the control logic program transmitted between the PLC and the host computer. This program is written by engineers or PLC manufacturers in the engineering workstation (EWS), and then converted from the source code to assembly code and binary code by a dedicated compiler and assembler. In the traffic data of ICP, there is a special type of traffic, which is the control logic program transmitted between the PLC and the host computer. The program is written in the EWS by engineers or PLC manufacturers, and then the source code is converted into assembly code and binary code by a dedicated compiler and assembler. This type of traffic contains the “behavior description” of the PLC device, which can be restored to assembly instructions or more readable high-level programming languages by decompilation technology, thereby revealing the true behavior logic of the device and realizing the security detection of protocol reversal. Manufacturers usually provide dedicated decompilers for their own PLCs. General decompilers such as IDA Pro and Ghidra can also solve the format differences and traffic identification and adaptation problems between manufacturers to a certain extent.
Focuing on decompilation, Anastasis Keliris et al. proposed the ICSREF framework for automatically analyzing PLC binary files created using CODESYS, one of the three major software platforms for ICS [52]. The framework can automatically and completely reconstruct the control flow graph (CFG) of any given CODESYS binary file. This decompilation and reverse engineering framework for PLCs from specific manufacturers works well, achieving 100% accuracy for its self-built PLC program database data, which is significantly better than general decompilation such as HexRays and RetDec decompilers.
ICSREF demonstrates excellent reverse analysis capabilities in its applicable scenarios, and also provides useful references and methodological inspiration for subsequent PLC decompilation and reverse applications. However, it is mainly aimed at Schneider Modicon PLC, and its cross-vendor compatibility needs to be expanded. And its format extraction mainly uses sequence recognition of specific offsets, symbol patterns, and segment structures, which is very dependent on the stability of the binary file format. In addition, ICSREF uses the control application binary file compiled by CODESYS2 engineering software as the analysis object, but it is worth noting that ICSREF takes the availability of input data as a prerequisite, and its work focuses on static analysis and semantic reconstruction based on the acquired binary file, and does not include the data extraction process in the method system. In the actual ICS environment, the control logic program is usually encapsulated in a private format inside the PLC device. Its acquisition method involves the upload function of the engineering software, device interface access, and data extraction at the firmware level, which has certain technical thresholds and applicability limitations. This presupposition means that its method is highly dependent on the user’s ability to directly obtain the PLC control application binary file in actual applications.
In view of the shortcomings of related work in this direction, Syed Ali Qasim et al. proposed Reditus, a control logic forensics framework based on the decompilation capability of engineering software, which is used to achieve the goal of automatically recovering the PLC control logic source code from ICS network traffic [53]. The framework obtains binary control logic by passively intercepting network traffic (mainly messages transmitted from engineering software to PLC), and determines the PLC manufacturer to which the message belongs based on known protocol features, and converts the message into a visual high-level control logic source code using the decompilation function built into the corresponding manufacturer’s engineering software. The restored source code is then audited and forensically analyzed to identify potential malicious modifications or attack behaviors.
This method assumes that the manufacturer can be correctly determined by comparing features such as the default port or iconic sequence of the protocol, and uses the manufacturer’s existing decompiler as a data processing method. Reditus itself does not directly parse private formats but uses the capabilities of engineering software to avoid parsing details. This reliance on the manufacturer ecosystem reduces the robustness of the method to a certain extent, and may not be able to cope well with PLC firmware upgrades, compiler updates, etc. It needs to be further improved in terms of versatility.
In further research, Yangyang Geng et al. proposed the CLADF framework, a solution framework that can detect potential control logic attacks and perform forensics and backup recovery when the device is running [54]. In order to automatically obtain the control logic program code from the PLC, CLADF first performs an upload operation and captures the network data packets between the programming software and the PLC, simulates the upload operation by replaying the request sequence, obtains the target binary file, and uses a general disassembly tool to generate a standardized Assembly Log File. Compared with Reditus, which tends to restore readable source code, CLADF only generates Assembly-Level Instructions. The extracted assembly instruction sequence is then compared with the correct version to locate the tampered, inserted or deleted subroutines.
This method still requires binary format adaptation and customized development for different PLC manufacturers, and relies on the ability to obtain a trusted correct version file as a comparison benchmark. Disassembly cannot restore high-level control logic (such as ladder diagrams, ST language), and can only be used for difference detection. However, its feature of not pursuing source code level restoration makes it independent of the decompilation capabilities of specific engineering software, and improves the independence and robustness of the tool.
4.1.2. Firmware extraction and log analysis
Reverse engineering with protocol implementation programs or program operation records as data sources mainly analyzes static data stored inside industrial control devices (such as PLC, RTU, HMI). Implementation program code usually contains core logic such as protocol parsing, data interaction, state transition and security mechanism. By restoring its structure and behavior, the specific implementation of the protocol at the device level can be revealed. For example, the control logic program flow of the aforementioned decompilation method is essentially the control application program for the device itself to run when the PLC firmware is uploaded or downloaded. This type of data source contains key logic and semantic information in the protocol implementation process, which provides important data support for protocol reverse analysis, behavior modeling and security verification, and is an important entry point for reverse analysis of control logic. The log information contains detailed operating instructions, system events and protocol interaction records of industrial control equipment, including key information such as timestamps, command types, and response status. By parsing the above content, the implementation logic and data interaction methods in the firmware can be analyzed. However, obtaining class data sources for program code and log information usually requires access to specific devices or work planes, which is more difficult than obtaining network traffic data sources.
Obtaining firmware data by accessing memory through device interfaces is a feasible approach. Jonas Zaddach et al. extracted binary file data such as boot firmware of hard disk drive (HDD) and firmware of wireless sensor node (Zigbee device) when implementing the Avatar framework for dynamic analysis of embedded devices [55]. This process uses JTAG interface and serial port to access memory to read code data in storage units such as RAM and Flash. For devices that do not support the above methods, Acatar further uses dedicated devices such as CH34A or combines with QEMU virtual machine to extract data.
This method extracts firmware code data by directly accessing the device storage media (such as Flash, EEPROM, etc.), thereby realizing high-precision direct export of the internal program of the device, which can fully retain key information such as protocol implementation, data parsing and control logic, and has higher reliability and analysis depth. However, this method usually relies on actual contact and operation of the target physical device, requires the ability to connect and extract data from the hardware interface, and may face practical challenges such as device packaging, encryption protection and limited access rights in some scenarios. Therefore, although this method has obvious advantages in obtaining data accuracy and completeness, its applicability is still limited in large-scale applications and non-invasive analysis scenarios.
In protocol reverse engineering research in the field of firmware and log analysis, authors usually assume that data sources can be obtained through direct export or through manufacturer release [56]. This availability assumption facilitates method design, allowing research work to focus on core technical aspects such as protocol restoration, logic reconstruction, and security analysis. In actual application environments, due to the limitations of manufacturer-proprietary formats, storage encryption, and access permission control, directly exporting firmware data requires professional means such as device disassembly and hardware interface access (such as JTAG, UART, SPI Flash Dump). Although PLC manufacturers will release firmware update packages or maintenance logs, their real-time nature and consistency with the actual operating environment do not always meet the needs of reverse analysis, especially when faced with customized devices, older models, or private protocols that are not publicly documented.
4.2. Message clustering
In the reverse engineering of ICPs, message clustering refers to the automatic grouping of protocol data sources based on the similarity of data structure and content characteristics, and the automatic division of original protocol data sources into message categories with consistent characteristics, so as to identify the commonalities of the same message types in the protocol (such as read requests, write requests, responses, exception notifications, etc.). This process helps to reveal the format variation and field distribution characteristics of protocol messages, and is also an important basis for subsequent core analysis tasks such as field division, semantic derivation, and state machine modeling.
Since ICPs (especially private protocols customized by manufacturers) often lack public documents, and their data sources show significant diversity and opacity in terms of message types, field layouts, and semantic encoding, the method of relying on manual parsing and classification of messages one by one is inefficient and difficult to adapt to the analysis needs of large-scale traffic data. Therefore, the introduction of automated message clustering technology can analyze data in batches without prior knowledge, explore potential structural patterns in protocol data sources, and reduce the intensity of manual intervention in reverse analysis. At the same time, in view of the inherent format of industrial protocols, the limited number of message types, and the relatively fixed field positions, the message clustering method can effectively avoid the performance bottleneck and insufficient generalization ability of the traditional comparison-based protocol reverse engineering method when facing a large amount of heterogeneous data. Dividing message clusters can effectively improve the accuracy and overhead of field extraction and semantic inference, and also provide a structured data foundation for protocol state modeling and fuzzy test input generation, becoming an indispensable key link in the protocol reverse analysis link.
In the reverse engineering of ICPs, message clustering is a prerequisite for protocol format restoration and semantic deduction. Its method system can be systematically divided from the perspective of clustering strategy. Combined with the existing work in this field, this paper summarizes the message clustering methods in ICP reverse engineering into two categories: hierarchical clustering methods and model-driven clustering methods, which represent clustering methods based on explicit similarity measurement and implicit probability modeling, respectively.
4.2.1. Hierarchical clustering
Hierarchical clustering methods achieve hierarchical clustering of protocol message samples by merging layer by layer from bottom to top or subdividing layer by layer from top to bottom. This type of method is usually based on byte sequence similarity, field pattern matching or feature vector distance, and clusters the protocol messages according to the explicit differences in structural features. Represented by the traditional UPGMA method in Netzob, hierarchical clustering can make full use of the domain characteristics of industrial protocols with relatively fixed message formats and clear type boundaries to achieve efficient and intuitive message type division [57]. In contrast, when faced with complex private protocols with frequent field reuse and flexible and changeable formats, the clustering accuracy and generalization ability of this type of method are still limited by the ability to express features.
When faced with industrial protocol characteristics, general protocol reverse engineering tools (such as Netzob and AutoReEngine) often have obvious deficiencies between clustering effect and computing performance. In response to this situation, Kyu-Seok Shim proposed an automatic reverse engineering method for ICP structure analysis, the core of which is a two-stage message clustering strategy based on coarse-grained clustering of message length + fine-grained clustering of message similarity [58]. The author noticed that the same type of ICP messages usually have a fixed message length. Therefore, messages with significant length differences are first quickly divided into different clusters to reduce the complexity of subsequent calculations. Then, the Mean-Shift clustering algorithm is used to finely cluster similar messages based on the similarity between messages to identify different types of protocol messages. The clustering results are finally put into the CSP algorithm to extract static fields and message structure segmentation, and the temporal logic between messages is restored through dialogue analysis.
The above method uses Mean-Shift for two-stage hierarchical clustering without presetting the number of clusters or distance thresholds. This method can adapt to clusters of any shape and distribution. Its clustering result is essentially the natural division of density peaks in the data space, which has intuitive physical meaning and better interpretability. Compared with K-means, UPGMA and other methods, it is more suitable for processing private protocols with unknown types and complex formats. However, its computational overhead and sensitivity to bandwidth parameters restrict its wider application.
When processing binary protocols, there are difficulties in extracting message features and determining the number of clusters. Di Tong et al. proposed a density peak-based ICP message clustering method to improve the accuracy of reverse format extraction and state machine derivation [59]. The authors used n-gram technology in natural language to construct the features of ICPs and constructed feature vectors of ICP data messages based on the generated n-gram fragments. Subsequently, an improved density peak clustering algorithm (DPC) was used to avoid the problem of traditional methods being difficult to initialize the number of clusters and cluster centers, thereby achieving automatic clustering of ICPs. Unlike traditional clustering methods such as K-means, DPC does not require a preset number of clusters, but automatically identifies cluster centers through local density and relative distance, thereby enabling more accurate classification of protocol messages.
The above-mentioned Density Peak Clustering method establishes a hierarchical dependency relationship from “point-point” to “cluster-cluster” by calculating the local density of each point and the minimum distance to a higher density point. After identifying the cluster center, the remaining samples are gradually assigned to the corresponding center through the propagation of density gradients. This “center-slave” relationship constitutes an implicit hierarchical tree structure. Compared with traditional flat clustering, it can provide stronger feature expression ability and class number adaptability, and perform better when processing binary protocols.
In the hierarchical clustering of ICPs, the computational overhead of the reverse process is a key factor affecting the scalability of the system. Methods such as UPGMA or density peak clustering require the calculation of the distance matrix for all data points, which has an O(n2) computational bottleneck and faces a significant efficiency bottleneck when processing large-scale protocol message data sets (such as thousands to tens of thousands of traffic messages). Based on this problem, Yukai Ji et al. proposed IMCSA, a clustering model with lower computational overhead [60]. By introducing a keyword-driven pre-clustering mechanism, the sequence alignment space is effectively reduced, thereby significantly improving the overall processing efficiency while ensuring clustering accuracy.
IMCSA proposes an improved Bag-of-Words Generation (BWG) algorithm to automatically extract field segments with high-frequency occurrence and low information entropy features as keywords of protocol messages. In the feature construction process, the algorithm introduces forward and backward offsets to alleviate the feature mismatch problem caused by field lengthening or structural misalignment. Subsequently, IMCSA constructs a sparse Bag-of-Words feature vector based on keywords and performs a pre-clustering to classify messages with similar structures into the same cluster. After this pre-clustering process, the computational complexity of the subsequent multiple sequence alignment (MSA) is reduced from the global O(n2) to the alignment operation within the local cluster, which significantly reduces the alignment space and improves the computational efficiency and scalability of the overall protocol reverse process. Through this strategy, IMCSA realizes a computational paradigm shift from “full sample alignment” to “intra-cluster alignment”, while ensuring clustering accuracy, it alleviates the performance bottleneck of sequence alignment in protocol reverse engineering to a considerable extent, and makes up for the disadvantage of hierarchical clustering methods in computational overhead.
4.2.2. Model-driven clustering
Model-driven clustering refers to clustering analysis of protocol messages from a higher-level semantic feature space through methods such as probabilistic modeling and latent variable inference. Typical methods include Type-Aware clustering based on Latent Dirichlet Allocation (LDA) topic modeling, symbolic execution path-guided clustering based on message field coverage optimization, etc. This type of method breaks through the limitation of relying solely on byte-level similarity. It can mine potential type associations and structural differences between messages without setting the number of classes a priori, and adapt to the actual needs of complex industrial protocols with mixed field semantics and frequent format variations. Although model-driven clustering has stronger expressiveness and adaptability, its analysis process places higher requirements on algorithm convergence and computing resources, and still needs to balance optimization between engineering efficiency and accuracy.
To make up for the shortcomings of the above methods and improve clustering accuracy, Xin Luo et al. proposed a type-aware message clustering method based on LDA topic modeling [61]. LDA was originally a model used to discover the “document-topic-word” relationship from text. By observing the word co-occurrence pattern in a large number of documents, it reversely infers the implicit probabilistic relationship between documents and topics, and topics and words, so as to find out which topics prefer which words, and which topic components each document contains, and realize the association modeling between “document-topic-word”. This process is based on Bayesian inference to achieve unsupervised topic discovery without relying on manual annotation. The method in this paper regards protocol messages as “documents”, implicit message types as topics, n-gram subsequences as “words”, and type information as the probability that a message belongs to a certain type. The association between messages and implicit types is inferred by LDA, thereby realizing the semantic hierarchy of messages.
Unlike traditional clustering methods based on global sequence alignment or high-frequency field matching, LDA modeling can capture the potential connection between field co-occurrence and type features, and enhance the ability to express complex protocol features such as field reuse and flexible format. Based on the similarity measurement of type distribution, the clustering process not only considers the surface features at the byte level, but also incorporates high-level semantic information, thereby improving the clustering accuracy and the interpretability of the results.
However, it is worth noting that the bag-of-words assumption of LDA ignores the sequence order information. The message is regarded as an unordered set of n-grams in this method. The sequence dependency and structural hierarchy relationship between fields may be ignored. In protocols with strong sequence dependency (such as protocols with strict state transitions and sensitive field order), the clustering effect may be limited. At the same time, the n-gram feature space of protocol messages is often high-dimensional and sparse, resulting in high computational complexity in the inference process of the LDA model, especially in large-scale traffic data scenarios, where computing resources and time consumption are obvious. In addition, LDA relies on a large number of samples for statistical modeling. When the data scale is small and the samples are sparse, it is difficult for the model to fully learn the type-field co-occurrence relationship, and the clustering effect may decline.
In protocol reverse engineering, whether the sample data can fully cover all types of message types and field features defined in the protocol specification is a key factor that directly affects the message clustering effect and the accuracy of protocol format restoration. However, in fact, it is unrealistic to capture a large number of network messages covering all fields defined in the protocol specification. Limited by the scope of traffic collection, business scenario coverage and protocol usage frequency, analysts often find it difficult to obtain large-scale, high-quality network traffic samples covering all fields and boundary conditions in the protocol specification. Insufficient coverage of sample data will inevitably lead to a decrease in the accuracy of sample-driven clustering methods in type distinction and format inference, further affecting the integrity and effectiveness of protocol reverse engineering.
In order to alleviate the strong dependence of existing protocol reverse engineering methods on the acquisition of high-quality traffic samples, Yue Sun et al. proposed a protocol reverse analysis framework Spenny based on symbolic execution [62]. Through program behavior modeling, “field coverage” is used as a path priority evaluation indicator. Through Symbolic Execution, the access range and density of each path on the input message field are dynamically evaluated, thereby guiding the path exploration order and achieving effective distinction of message types and accurate clustering of field extraction. Finally, the message format and field information can be automatically extracted from the protocol implementation program, and the automation level and applicability of protocol reverse engineering can be improved in the absence of sufficient traffic samples.
Symbolic Execution is a program analysis technology that systematically explores all possible behaviors of the program by assigning “symbolic values” rather than specific numerical values to program inputs and deriving program states based on symbolic expressions when tracing program execution paths. During symbolic execution, Spenny monitors the statistical field coverage in real time to measure the degree to which each field in the input message is parsed, accessed, and operated by the program during program execution. The path-field coverage matrix is constructed as the feature space for clustering, and similarity coefficients such as Euclidean distance, cosine similarity, and Jaccard are used as indicators of path similarity to aggregate paths with similar path-field coverage characteristics into the same message type cluster, complete path-level message type clustering, and achieve effective mapping from program behavior to protocol semantics. This method breaks through the traditional clustering method driven by traffic similarity, gets rid of traffic sample dependence, is more suitable for private protocol reverse engineering needs, and has stronger semantic expression capabilities and automation levels.
4.3. Protocol field segmentation
ICP field segmentation refers to dividing the byte stream in the original message into field units with structural boundaries and potential semantics, thereby restoring the format structure of the protocol message. This process is the key link between message clustering and field semantic recognition, and directly determines the accuracy and effectiveness of subsequent field type determination, control logic extraction, and state machine modeling. Since ICPs widely use private customized formats and lack standardized documents, their message structures usually have complex features such as variable field length, non-fixed position, and nested structure. As a result, the manual segmentation method based on manual comparison is not only inefficient, but also difficult to apply the automatic parsing of large-scale, multi-source heterogeneous protocol data. Therefore, studying efficient and scalable automatic field segmentation methods has become an important research direction in the reverse process of ICPs.
Combining the field segmentation basis and method characteristics of existing research work, we summarize the field segmentation methods in ICPs into three categories: field segmentation methods based on data patterns, field segmentation methods based on program behavior, and field segmentation methods based on statistical learning and clustering reasoning. Among them, the first type of method mainly relies on the structural features in network traffic, and identifies field boundaries through strategies such as frequency analysis, sequence alignment, or control field guidance; the second type of method uses program analysis techniques such as dynamic taint analysis and symbolic execution to track the processing path of the field from the protocol implementation logic, and realize the joint analysis of structure and semantics; the third type of method models the field division task as a classification problem, and uses the voting mechanism driven by training data, probability modeling, or deep learning to automatically infer the field division position. These three types of methods have their own advantages and disadvantages when facing different characteristics of data acquisition conditions, protocol complexity, and reverse goals, and are important technical supports for building ICP field-level restoration capabilities.
4.3.1. Field segmentation based on byte pattern
The field division method based on data pattern refers to the method of directly inferring field boundaries from the original byte stream without accessing the protocol implementation program by extracting the byte distribution characteristics, structural consistency and alignment rules of the message in the network traffic, and using byte-level analysis methods such as sequence alignment. Beddoe was the first to propose the use of sequence alignment for protocol reversal. In his Protocol Informatics system, he pioneered the application of the multiple sequence alignment (MSA) algorithm in the field of bioinformatics to the analysis of network protocol message formats, breaking through the limitations of traditional reliance on artificial inspiration or manual labeling, and establishing “structural consistency + local variability” as an important data-driven standard for field division [63]. It also promoted the further development of a number of studies in the same field, provided a formalized and universal analysis framework for field extraction in undocumented protocols, and greatly improved the automation level of protocol reversal.
This type of method assumes that protocol messages of the same type are similar in structure, so they can be automatically identified using features such as byte frequency, entropy value, repetitive pattern or template structure. In terms of technical implementation, common means include token frequency statistics in sliding windows, information entropy analysis of each byte position, and multiple sequence alignment between messages. This method is suitable for ICP scenarios where the protocol fields are relatively stable and the positions are relatively fixed. It is particularly effective when dealing with vendor-proprietary protocols that are not public, have regular structures, but lack execution context support. It emphasizes extracting boundary information from the byte-level structure and can perform field division without program executable code or prior format documents. It is one of the field division strategies commonly used in current ICP reversal and has strong engineering applicability. For example, Ouyang Liu et al. considered that a major limitation of protocol reversal frameworks based on explicit textual identifiers such as Discoverer is that pure binary protocols without text subsequences cannot be inferred [64]. Therefore, the authors proposed an automatic protocol reversal tool IPRFW for protocol fields. The framework adopts a bottom-up field merging strategy, using field input information to periodically merge adjacent small fields into a new field. The main novelty of this method is that the Entry Distance Cluster (EDC) algorithm is proposed to measure the distance between two adjacent potential fields and cyclically merge adjacent fields to infer field boundaries. The main idea of the algorithm is that if the entries of two adjacent fields are close, the two fields are more likely to be regarded as a complete field. Therefore, for every two adjacent potential fields, this method calculates the degree of difference in their distribution through indicators such as KL divergence and L1 distance and records it as its boundary score. Then, starting from the smallest unit, the adjacent field units with the smallest score are continuously merged, that is, they are considered to belong to the same logical field until the merged distance exceeds the threshold or reaches the maximum field width limit.
As a protocol field division method based on byte distribution pattern, this method has the advantages of lightweight calculation and structural adaptability. It has strong processing capabilities for pure binary data, does not rely on semantic tags or program context, and can be embedded in other reverse frameworks as field preprocessing modules. It is suitable for communication protocols with short fields, dense control fields, and compact structures (such as Modbus, IEC 104, etc.). However, its corresponding dependence on the significant difference in the distribution of values between fields also leads to the fact that IPRFW may reduce the recognition accuracy for protocols with similar field value distribution or containing variable-length fields and nested structures. In addition, this method does not consider further semantics and state machine restoration, and it is difficult to cope with scenarios with more complex protocol structures. Therefore, IPRFW is suitable for use as a preliminary structural restoration tool. In complex load structure scenarios, it is necessary to further explore the correlation between fields and payloads to improve the accuracy and completeness of protocol structure restoration.
In this regard, Yuhuan Liu et al. noticed that existing research on protocol format inference focuses on the identification and modeling of message header structures, but lacks in-depth exploration of the internal structure of the message payload part, and usually divides the entire message into a single variable-length field for processing [65]. This processing method ignores the potential structured patterns in the payload area. In actual ICPs, such as messages with function code 0x10 (write multiple registers) in the Modbus protocol, the payload part often contains multiple register write units with consistent structures and independent of each other, and each unit constitutes a separable sub-message. In addition, there may be multiple types of variants of sub-messages within the same function code, and the payloads between different function codes may also share sub-message forms with similar structures. The above phenomenon shows that the protocol payload area has more fine-grained structural information, and mining its internal sub-message structure is of great significance for improving field extraction accuracy and protocol format versatility.
Based on this view, the author proposed a new sub-message extraction algorithm, which uses template iteration as an intermediate step to finally infer the message field structure. First, the original ICP message is preprocessed and aligned at the byte level, and each message is represented as a byte sequence in a unified format. Then, the sliding window mechanism is used to traverse the message content and calculate the field variability index FVI (Field Variability Index) at each position to identify the structural mutation point as a candidate for the sub-message boundary. By counting the structural frequency and change trend at these candidate boundaries in multiple messages, the system automatically identifies the potential sub-message start and end positions in the message payload and extracts the sub-message fragments with stable structure and frequent occurrence. Then, according to the composition characteristics of the sub-message in the payload, the message is segmented and the format template is extracted to output the sub-message structure information such as the number of fields, field position, and length characteristics. Finally, the system combines the extracted sub-message template with each message to complete the complete restoration path from message to sub-structure and then to field structure, significantly improving the field recognition accuracy and the expression ability of the protocol format under complex payload structure.
4.3.2. Field segmentation based on program analysis
The field partitioning method based on program analysis refers to a method of restoring the protocol field boundaries by analyzing the access and parsing path of the protocol implementation program during the message processing process, using the dynamic information in the program execution. This type of method usually uses technologies such as dynamic taint analysis or symbolic execution, and uses byte marking and other technologies to track its propagation path during the program running process, so as to determine the position, length and function of each field in the protocol processing logic. By analyzing the interaction of byte streams in variable assignment, branch judgment, array indexing and other processes, its semantic role is explored, and then its structural boundaries are inferred. This type of method can capture the structural information and functional semantics of protocol fields at the same time, and is suitable for ICP reverse engineering scenarios with complex control processes and state dependencies. Representative work such as Prospex systems simulate the execution behavior of protocol programs, construct field processing path mapping diagrams, and realize the joint extraction of field boundaries and field semantics [66]. Although this method performs well in terms of accuracy and deep understanding, it depends on the availability and executableness of protocol implementation programs, and there are certain computing and environmental deployment costs when processing large-scale and diversified protocols.
Compared with other protocol analysis methods, a major advantage of program-level analysis is that it does not rely on the semantic information carried by network traffic, so it is still effective in situations where the protocol trace is unavailable or incomplete. Juan Caballero and others noticed this feature as early as the early days of automated protocol reverse engineering research [67]. In response to the dependence of Discoverer and other methods on the availability of network messages at the time, the author proposed the Polyglot framework, which is based on dynamic taint marking technology and realizes automatic restoration of message format structure without communication data by analyzing the protocol implementation program itself. The technical process of the Polyglot framework includes four key stages: first, each byte of network input is set as an independent taint source, and the propagation path of these bytes in the protocol implementation program is recorded using dynamic taint tracking technology; then, a context vector is constructed for each byte, encoding information such as the function call stack, operation semantics, and usage location during its processing; bytes are clustered based on context similarity, and bytes with consistent processing behavior are merged into the same field, solving the problem of fields being split or propagated across functions; finally, the order relationship and structural pattern of fields are extracted by generating a field dependency graph, and a protocol format template containing field boundaries, types, and structural constraints is output. The framework is particularly suitable for complex scenarios such as multi-protocol coexistence, field reuse, and structure nesting.
The Polyglot framework can accurately identify the protocol field structure through program-level dynamic analysis. This is based on the implementation of the protocol processing incoming protocol messages. It can reveal a lot of information about the protocol format, so the protocol format can be reversed by analyzing the byte traces in the protocol implementation. This method is suitable for complex protocol scenarios such as multi-protocol coexistence, field reuse, and structure nesting, and does not rely on network traffic or prior format information. Polyglot’s pioneering work in this field has promoted the expansion of the protocol reversal field from traffic analysis to program semantic modeling, providing stronger structural and semantic support for document-free protocol format restoration, and also providing an important foundation for subsequent semantic reasoning, attack detection and other applications. However, dynamic taint analysis itself has a large computational overhead and is difficult to expand to extremely large-scale or black-box devices. At the same time, although instruction-level semantic information is indeed useful in extracting message fields, it can only reveal the “flat” structure of the protocol format, that is, a linearly arranged, non-nested, and field-independent protocol message format.
In order to reverse engineer network protocols more accurately and thoroughly, it is equally important to expose the complex hierarchical structure of protocol messages and reveal cross-field relationships. In this regard, Zhiqiang Lin et al. proposed AutoFormat, a protocol format reverse engineering method based on program dynamic analysis [68]. By combining dynamic taint analysis with context-aware mechanisms, it can automatically restore the message field structure and semantic information from the execution process of the protocol implementation program without any protocol documents or network traffic. This method takes the actual use path of the program for the input data as the core, records the propagation trajectory, call stack, control dependency, and access semantics of each byte during the execution process, aggregates bytes with similar processing patterns into fields, and extracts detailed field boundaries, types, and semantic roles of fields in the protocol processing flow, such as command fields, length fields, or data segments, and then generates protocol format templates. Compared with traditional traffic-based passive analysis methods, this solution performs better in identifying field semantics and nested structures, and is especially suitable for scenarios with no traffic, encrypted communications, or complex nested structures. It is one of the important foundational works in the program-level protocol reverse engineering direction.
The way to implement protocol identification and process protocol messages provides a lot of information about the protocol format. Works based on program analysis perform well on text-based protocols (e.g., HTTP). However, in ICS protocols, there are usually no delimiters or keywords in text-based protocol messages. For analyzing binary-based protocol programs, Rongkuan MA et al. believe that the aforementioned methods are usually too coarse-grained and propose a more fine-grained Industrial Control System Protocol Reverse Engineering Framework (ICSPRF) [69]. This method uses dynamic taint analysis technology to identify the access behavior and context information of the protocol implementation program when processing input messages, thereby inferring the field structure and boundaries of the protocol message. Specifically, this method first sets each byte in the message as an independent taint source, and records the propagation path and processing method of the tainted bytes in memory by running the protocol implementation program. Then, the machine instruction logs that access these taints are collected, and the byte sets with similar access patterns are extracted. The field boundaries and structure templates are constructed based on the instruction type, stack information, and access order, and finally the protocol format structure is output.
The ICSPRF framework can accurately capture the access behavior of input fields in the protocol implementation program through dynamic taint analysis, and identify the field boundaries and structural relationships of protocol messages without network traffic or protocol documents. It is suitable for encrypted communication or private protocol scenarios, and has a high recognition accuracy when processing complex protocol features such as variable-length fields, nested structures, and control fields. At the same time, its field clustering strategy combines access offsets, instruction types, and call stack contexts, and can obtain the field meaning while restoring the field structure. This method promotes the transformation of ICP reverse engineering from passive traffic analysis to active program semantic modeling, and provides a solid foundation for applications such as automatic structure recovery, fuzz testing, and intrusion detection of undocumented ICPs. It is a cutting-edge achievement in field partitioning research for program analysis.
4.3.3. Field segmentation based on Model Learning
The field division method based on model learning refers to modeling field boundary inference as a pattern recognition or classification task, and predicting the field division position through supervised learning or unsupervised learning algorithms with the help of statistical features and structural laws in a large number of observed samples. This type of method does not rely on the protocol source code, but extracts features such as byte frequency distribution, degree of variation, position entropy, relative offset stability from network messages, and infers the field structure by combining cluster analysis, topic modeling or graph modeling. In more complex scenarios, deep learning techniques such as RNN and Transformer structures can also be introduced to model the sequence of messages to achieve more robust modeling of variable-length fields or field nesting relationships. Through the training mechanism, the model can accurately identify field types and boundaries, with strong adaptability and good scalability. This type of method shows significant advantages in scenarios where there is a lack of protocol procedures and diversified field structures, but its performance is highly dependent on sample quality and model parameter design, and there may be errors in the recognition of extremely rare fields or highly dynamic structures.
There are two main challenges for reverse engineering of ICPs for traditional general reverse engineering methods. First, the code of ICS applications is difficult to obtain, which makes the program analysis-based methods have natural defects. Second, ICPs are very different from network protocols. For example, most ICPs have a “flat” structure, that is, a header with global information of the message, followed by an optional payload, and most ICPs have no delimiters and are short in length, which reduces the performance of statistical-based methods [70].
In response to these difficulties, Xiaowei Wang et al. proposed IPART, an automatic reverse engineering tool for ICPs based on global voting experts [71]. The core idea is to build a “global voting expert” mechanism by integrating multiple heuristic rules to efficiently identify protocol field boundaries and structural information from the original industrial network traffic. This method first preprocesses and preliminarily clusters the captured protocol messages, and then introduces multiple field expert perspectives (such as byte distribution frequency, value stability, field recurrence rate, position offset law, etc.) to independently judge the boundaries of each field, and then integrates the judgment results of different experts through a global voting algorithm to improve the consistency and robustness of field division. IPART not only supports automatic parsing of standard protocols such as Modbus and S7Comm, but also shows good adaptability and scalability when facing private protocols with complex structures and frequent variable-length fields. Experimental results show that its field recognition accuracy and stability are better than traditional analysis methods based on single features, with a high level of automation and engineering application potential. It is one of the representative multi-feature fusion methods in the field restoration link of ICPs.
The advantage of using a classification model is that it improves the accuracy of field boundary identification by integrating the judgment results from multiple feature dimensions or heuristic rules. Different experts can make independent judgments based on features such as byte frequency, value stability, position offset rules, and degree of variation. The voting mechanism can effectively integrate these judgments, thereby reducing the interference caused by misjudgment of a single feature and improving the ability to make consistent judgments on boundary positions. In particular, it exhibits stronger fault tolerance and stability when facing ICPs with complex structures, longer fields, or more message noise. At the same time, the algorithm does not rely on deep learning models or a large amount of annotated data, has higher interpretability and lower computational overhead, is easy to deploy in actual industrial environments, and supports flexible expansion of the expert set as new features or rules are introduced. It is a lightweight, efficient, and scalable structure recognition method suitable for the field division link in ICP reverse tasks.
4.4. Key field identification
Key field identification in ICPs refers to identifying control field units that have a decisive influence on protocol behavior from message data, such as control fields, function codes, session identifiers, etc. This process directly affects the restoration quality of protocol interaction logic, the accuracy of state migration mode construction, and the effectiveness of fuzz test scenario generation. One advantage of being able to identify key fields is that it can simplify the protocol format parsing process. For example, grouping and comparing messages according to function codes or session ids helps to focus the analysis on the field areas that are most sensitive to structural and semantic changes, improving the interpretability of field semantic inference and protocol behavior modeling. On the other hand, the identification of key fields provides core input for subsequent applications such as state machine construction, fuzz test generation, and anomaly detection, effectively supporting vulnerability mining and security assessment for control logic, and also plays a role in improving performance and reducing computing costs in scenarios such as protocol adaptation, protocol clustering. It is a key step in realizing automated reverse engineering and behavior restoration of ICPs.
Different types of ICPs have different field structures, and identifying which fields are key fields that have a decisive influence on communication behavior is the basic link for realizing protocol structure restoration and semantic modeling. Ignacio Bermudez et al. systematically summarized the field categories with high behavior indicativeness in the FieldHunter system, and pointed out that the following six types of fields play a core control role in protocol interaction [72]:
-
Message Type field (MSG-Type), such as the flag bit in the DNS protocol or the GET/POST operation code in the HTTP protocol, which is used to indicate the functional semantics of the message;
-
Message Length field (MSG-Len), which is commonly found in the TCP stack and is used to define the boundary range of the application layer message in the byte stream;
-
Host Identifier (Host-ID), such as the client ID or server ID, is used to maintain the communication context in a multi-host scenario;
-
Session Identifier (Session-ID), such as mechanisms such as cookies, is used to associate multiple logical messages to form a complete session flow;
-
Transaction Identifier (Trans-ID), such as a sequence number or confirmation number, is responsible for request-response pairing and transaction consistency maintenance;
-
Accumulator fields, such as timestamps or counters, are used to reflect the evolution of system states or control traffic sequences.
The above field types are usually regarded as key control fields in protocol reversal and security modeling because they have a direct controlling effect on the protocol behavior path, state transition logic, and interaction integrity.
In the development of automatic protocol reversal tools for key field identification, discover proposed by Weidong Cui et al. is an early and milestone automatic protocol reversal tool [73]. Considering that manual reversal is time-consuming and non-standardized, the authors proposed to cluster messages based on the distribution of Text Token and Binary Token in the input data, and classify messages with the same Token Pattern and the same communication direction into a cluster. Then, all messages in the current cluster are counted for fields, and the attributes of each field are analyzed, such as the frequency of changes in the value of the field, the length regularity of subsequent fields, and whether it is a field index. Based on these heuristic indicators, discover can identify the Format Distinguishers field (i.e., Key Field) in the cluster. Different values of this field correspond to different message structure patterns. The messages in the current cluster can be divided according to the value of the FD field to generate multiple sub-clusters with more consistent structures. Each sub-cluster is then used as a new cluster, and the above steps are repeated recursively until the termination condition is met.
Discoverer is one of the earliest systematic tools in the field of automatic protocol reverse engineering. It can automatically infer the format structure of protocol messages from network traffic with the key field divided as the core without prior knowledge. It has the characteristics of high versatility and no need for label support. However, as an early study in the field of automatic protocol reverse engineering, Discoverer also has limited recognition capabilities, especially the inability to recognize pure binary protocols, and the tool has high computational overhead costs, which need to be further improved. Overall, Discoverer’s pioneering work has laid the core process standards for field division and format recovery to a considerable extent, and has had a profound impact on subsequent methods such as Netzob, ReFormat, and Spenny [57, 62, 74]. Although its accuracy and semantic capabilities are still limited, its structure-driven clustering ideas are still widely used.
Another key problem with Discoverer is that it identifies joint fields based on the similarity of sequence structures. This method and other alignment-based or token-based key field identification methods rely on the assumption that if messages have similar values or patterns, they are of the same type. However, in many cases, this assumption does not hold. In fact, when a client or server receives a message, it only determines the message type by keywords, and the data sequences of messages with the same type may be completely different. Therefore, if we can directly infer the fields representing keywords, we can get more ideal clustering results. Based on this, Yapeng Ye et al. proposed NetPlier, an ICP reverse engineering method with control field drive as the core [75]. This method identifies control fields in protocols by constructing a probabilistic graph model. The core idea is to use the dominant effect of control fields on message structure patterns to analyze a large number of network messages under unsupervised conditions to infer the causal relationship between fields and structures. First, NetPlier parses the original message into multiple field candidates, extracts local byte patterns through sliding windows and generates field instances with unfixed positions and lengths, and then builds a joint probability model to represent the statistical dependency between fields and structural features (such as message length, field occurrence frequency, and field combination style); on this basis, NetPlier iteratively learns model parameters through maximum likelihood estimation and Bayesian inference techniques, screens out those control fields that have a significant impact on other field boundaries, occurrence probabilities, or value distributions, and sorts them based on confidence based on indicators such as field entropy changes and structural prediction consistency, thereby identifying the most likely main control fields such as message type fields, function codes, and transaction identifiers. This method can automatically discover the structure-driven relationship between fields when facing an unknown protocol, effectively supporting subsequent protocol format modeling and semantic recovery tasks.
NetPlier not only considers the similarity of statistical features of message formats, but also models the statistical dependency between the message fields and the structural changes, and accurately identifies the control fields according to the degree of dominance of each field on the message, showing strong protocol independence and robustness. It still shows good results in ICPs with variable field lengths and non-fixed field positions. Compared with the heuristic rule-based method, its inference mechanism is more interpretable and mathematically rigorous. The only problem is that the computational complexity of the graph model reasoning process is high, and efforts should be made to optimize the method overhead in the future.
Inspired by NetPlier, Zhen Qin et al. proposed the REInPro framework, a key field-driven ICP reverse engineering method, whose core idea is somewhat similar to NetPlier [76]. First, a set of candidate fields is extracted from a large number of original messages, and the statistical characteristics of each field in different messages, such as frequency, position distribution, value variability and entropy, are calculated to construct the similarity and intensity score between fields. Then, two core indicators are used: Field-Cluster ConsistencyField-Structure Dependency to identify the most likely candidates for control fields. These fields usually have strong message classification capabilities and structural guidance for the appearance and arrangement of other fields. On this basis, the messages are grouped by control fields, and field boundaries are segmented and semantics are extracted within each group, thereby converting the global field division task into subtasks with similar local structures, reducing the modeling difficulty caused by field length variation and unstable field order. Experiments on multiple ICPs such as Modbus and S7Comm show that compared with the traditional method of unified modeling of the entire message, the control field-driven strategy performs better in field recognition accuracy, semantic consistency and structural restoration accuracy. It is especially suitable for ICP scenarios with obvious functional differentiation and structural variation. It is a protocol reverse path with both universality and semantic sensitivity.
But unlike NetPlier, which does not involve direct semantic correspondence between fields and physical world control variables, REInPro analyzes the control logic of the PLC program, extracts the mapping between Control Fields and field behaviors, identifies the real physical semantics carried by the protocol fields (such as motor start signals, temperature threshold settings, etc.), and establishes a cross-level (program-protocol-physical) semantic bridge, which significantly enhances the integrity and credibility of the protocol field interpretation.
4.5. Field semantic recognition
In the reverse engineering of ICPs, field semantic recognition refers to the process of further inferring the specific role and function of each field in the protocol semantics after completing the field boundary division, such as command code, address, data payload, status code, etc., so as to achieve a deep understanding of the protocol behavior and interaction logic. This process is not only a key step in converting structural information into semantic knowledge, but also the core foundation for supporting advanced analysis tasks such as state machine modeling, fuzzy test generation and protocol consistency verification. Due to the widespread use of private designs in ICPs, the lack of field naming, lack of semantic description and strong context dependence often result in field semantics that cannot be directly determined by simple rules. At the same time, there are a large number of redundant fields, padding areas or repeated information in protocol messages, which further increases the complexity of semantic recognition. Therefore, the study of automated field semantic derivation technology has become a key path to promote the transition of protocol semantic modeling from structure perception to behavior perception. Combining existing literature and mainstream technical ideas, this paper divides the field semantic recognition methods in ICP reverse engineering into two categories: semantic inference methods based on feature matching and semantic recognition methods based on machine learning.
4.5.1. Semantic recognition based on feature matching
The semantic inference method based on feature matching refers to building a correspondence between a field and the semantic entity it represents based on the aforementioned reverse work and combining external observable feature information. This type of method usually explores the control or status information carried by the field in the protocol interaction by comparing the change pattern of the field with the synchronization of the external system behavior. Typical available matching objects include system operation logs, variable dependency paths in control logic, field equipment status records, industrial side information data, and even the same type of protocol features with known structures. By building the temporal consistency, sequence similarity, causal coupling or change trend correlation between unknown fields and external behaviors, the role of the field in the actual control semantics can be effectively inferred, such as operation commands, target values, status feedback, etc. The core advantage of this type of method is that it has the ability to semantically bridge between protocol fields and physical world entities, and has stronger interpretability. However, in practical applications, this type of method has a strong dependence on the quality and availability of synchronized data sources, and the construction of matching strategies needs to consider challenges such as context alignment, data noise and ambiguity. It is usually implemented in combination with machine learning, time series analysis or causal inference and other technologies, representing an important direction for the evolution from structural recognition to system semantic modeling.
A direct idea is to modify some target fields and observe the characteristics of the modification and system response changes. Qun Wan et al. proposed MSERA, a practical format and semantic joint reverse method for ICPs, which is used to automatically restore the protocol field structure from network traffic and identify the field semantic role [77]. In the semantic recognition stage, this method uses field variability analysis and control interaction characteristics to summarize the law of field changes in different message types, and combines the heuristic rule-based reasoning system to classify field types, including semantic labels such as function code, address, and data length. In addition, the article also introduces some expert knowledge for auxiliary constraints to improve the accuracy of semantic recognition. The author conducted experiments on multiple typical ICPs (such as Modbus and IEC104), and the semantic recognition accuracy exceeded 85%, indicating that this method has good practical value and automation level in ICP reverse engineering.
A key problem of this type of comparison method based on variation analysis is the dependence on a controllable interaction environment. In real ICSs, there is a risk of triggering anomalies or interfering with control processes. On the other hand, it is difficult to fully cover the combined dependency relationship between fields. In the case of multi-field joint encoding or protocol state dependency, single-point mutation is difficult to restore complete semantics. In addition, the efficiency and coverage of this method are limited by the design of mutation strategies and the granularity of response observations, making it difficult to achieve automation and generalization in large-scale protocol analysis. Therefore, this method is more suitable as an auxiliary means in the process of protocol semantic identification, rather than the only solution.
In order to reduce the dependence on the controllable interactive environment, a feasible path is to use the structural features and semantic information of known protocols to build a feature comparison mechanism, perform field-level similarity analysis and semantic migration in unknown protocols, and thus achieve protocol semantic reasoning under low or no interaction conditions. In this regard, Syed Ali Qasim et al. proposed the PREE framework, a method for ICP field identification based on heuristic rules [78]. The core idea of this method is to use the structural features and semantic knowledge of known protocols to support automated reverse analysis of unknown protocols through rule migration. The authors pointed out that ICPs often have commonalities in design concepts and field functions, and this consistency between protocols provides a basis for heuristic knowledge migration. PREE builds reusable rule templates and combines heuristic variables such as rolling window, vertical window and frequency table to identify the structural boundaries and potential semantics of fields and map them to typical field types, such as function code, memory address, data length, etc. This framework effectively realizes field structure division and preliminary semantic recognition without accessing protocol documents or implementing programs through explicit rule combination and feature extraction strategies, showing strong versatility, interpretability and domain knowledge transferability.
Another more reliable comparison object is the system operation log. Zeyu Yang et al. proposed the ARES framework, an automatic reasoning method for physical semantics of programmable logic controller (PLC) variables based on control invariants, which aims to reveal the correspondence between variables in PLC programs and the physical world in ICSs [79]. This method realizes the automatic mapping of variable physical semantics by comparing the dependency between sensor and actuator data records of industrial processes and variable entities in PLC programs. Specifically, the authors obtained the variable dependency by analyzing the PLC program source code, and then obtained the dependency of physical entities through the probabilistic correlation of each sensor and controller between the running records. Then, the dependency graph of the two was summarized, and comprehensive graph matching was performed to evaluate the correspondence between variables and physical quantities based on temporal consistency and causal coupling, such as identifying whether a variable represents temperature, current, flow rate or switch state. This method does not require access to source code or manual annotation, and provides a new idea for improving the semantic interpretability of fields or variables in ICP. It has important application potential in ICP reversing and PLC security auditing.
With the widespread deployment of IoT monitoring devices in ICS, Zheqiu Hetu et al. pointed out that video surveillance data from industrial sites can be used as a non-invasive protocol reversal feature reference source, and the captured physical operation behavior can be used to provide auxiliary verification and comparison basis for semantic reasoning of communication fields [80]. To this end, the author proposed a context-aware semantic recognition framework called CASI, which aims to realize automatic reasoning of the semantics of ICP fields by integrating industrial site video surveillance data and network communication data. This method first extracts the value transformation sequence of the message field from the network communication to model the field behavior trajectory, and uses the YOLOv5 and DeepSORT algorithms to detect and track the monitoring video of industrial equipment, thereby constructing the physical object action sequence corresponding to the field behavior. Then, a consistency scoring function is designed to measure the temporal consistency and logical causal relationship between the field value change and the physical action. By analyzing the synchronization degree and correlation between the field behavior and the physical operation, it is inferred whether the field has control semantics, and finally the automatic semantic recognition of key control fields such as instruction fields and status feedback fields is realized. This framework does not require access to protocol documents, PLC program source code or manual annotation tags, and has high versatility and automation capabilities. Its practicality and accuracy have been verified in multiple real industrial scenarios, demonstrating an innovative path that integrates multimodal perception and ICP reverse engineering, and promoting the research progress of ICPs from structural reduction to semantic understanding.
This method introduces video surveillance data from industrial sites as an auxiliary information source to perform field semantic recognition without relying on protocol documents or program source code, which has many innovative and practical values. On the one hand, video data has good interpretability and intuitiveness, and can effectively reflect the physical behavior of actual equipment, so that the causal relationship between field value changes and physical actions can be perceived and measured, thereby improving the accuracy and reliability of semantic inference; on the other hand, this method breaks through the limitation that traditional protocol reverse only relies on network data, and broadens the perspective of field semantic recognition through multimodal information fusion. However, its effectiveness still depends on the quality of video acquisition and scene constraints. For example, problems such as device field of view occlusion, insufficient video resolution, and action granularity ambiguity may affect the recognition accuracy of semantic correspondence. In addition, this method requires synchronization and association of communication data and video frame sequences, which increases system complexity and deployment cost, and is limited in usability in non-visual scenes or data asynchronous environments. Overall, this method proposes a new idea of introducing perception layer data into the protocol reverse process, which provides an important expansion direction for the deep semantic restoration of ICPs.
4.5.2. Semantic recognition based on machine learning
The semantic recognition method based on machine learning refers to the process of modeling field semantic reasoning as an interaction between an agent and the environment during the protocol reverse process. Through continuous action selection and feedback evaluation, the influence of field variation on system behavior is learned, and then the control or state semantics carried by the field is identified. This type of method usually uses the field in the protocol message as the operable variable, and the environment is composed of the ICS or its simulation model. The machine learning agent modifies the field value and observes the system response during the iteration process to build a strategy mapping between field action and feedback. Typical feedback signals include changes in device action status, changes in response message content, fluctuations in system operation indicators, etc., which are used to indicate whether the field variation triggers semantically related control effects. Through the long-term reward mechanism, the agent can gradually converge to the optimal correspondence between the field and its semantic role without explicit labeling. This type of method has good adaptability and exploratory ability, especially suitable for scenarios where field semantics are opaque, interaction dependencies are complex, or there is a lack of prior knowledge, providing a new technical path for realizing the automation and generalization of ICP semantic recognition. However, this method still faces challenges such as complex environment modeling, high state space dimension, and low exploration efficiency in actual deployment. It needs to combine domain knowledge constraints, sample optimization strategies, and efficient learning algorithms to improve its practicality.
The method based on unsupervised learning and deep neural networks shows strong automation and generalization performance in the semantic recognition of ICPs. In the absence of protocol documents and manual annotations, it can achieve efficient recognition of field semantics and restoration of protocol formats by modeling the structural features of traffic data and classifying field functions. Among them, the PREIUD framework proposed by Bowei Ning et al. is a representative example of this type of method. It systematically integrates technical paths such as cluster analysis, feature learning, and neural network classification, and provides an end-to-end solution for field semantic reasoning of private protocols. Its core process includes traffic preprocessing and byte vectorization, field boundary detection, field semantic classification, and protocol format restoration [81]. First, the original protocol message is converted into a variable-length byte fragment sequence using V-gram technology and vectorized into an acceptable input for the model. Then, local structural features are extracted through a convolutional neural network (CNN), and similar structural fields are unsupervisedly classified in combination with the DBSCAN clustering method to achieve preliminary recognition of field boundaries; in the semantic recognition stage, PREIUD constructs pseudo-labels using clustering results, and automatically classifies field functions through a deep feedforward neural network (DNN) without manual labeling, and can recognize the semantics of key fields such as Function Code, Address, and Value. Finally, the field structure and semantic results are displayed through visualization, and their accuracy is evaluated and the protocol format is restored. This method does not need to rely on protocol source code and label information, and is suitable for efficient field recognition tasks in private protocol environments. It has good generalization ability and automation level, but when dealing with protocols with high structural complexity or strong coupling between fields, it may still be limited by feature extraction and model generalization capabilities.
The PREIUD method realizes the joint modeling of field structure division and semantic recognition without the need for labeled data, significantly improving the automation level of protocol reversal. It is particularly suitable for private protocol scenarios with scarce labels. It is superior to traditional methods that rely on templates or heuristic rules in terms of generalization and adaptability. Its V-gram encoding combined with deep models can capture local structures and global patterns, and improves the robustness to variable-length fields and structural variations. However, its clustering and pseudo-labeling mechanisms are sensitive to the quality of initial feature extraction. If the field distribution in the input protocol is unstable or there is a complex nested structure, it may affect the clustering effect and semantic classification accuracy. At the same time, it needs to rely on strong computing resources during model training, and there may be recognition errors for rare fields in extreme scenarios. Therefore, while ensuring the versatility of the model, it is still necessary to introduce more domain priors or structural constraints based on the particularity of ICPs to improve recognition accuracy and interpretability.
4.6. Protocol state machine generation
In the reverse engineering of ICPs, protocol state machine generation refers to the construction of a finite state machine (FSM) that can reflect the operation process of the protocol by modeling the message interaction sequence and state transition logic in the communication data, restoring the internal state transition mechanism of the protocol and its triggering conditions. This process helps to understand the control flow structure and execution path of the protocol, and also provides dynamic behavior model support for security analysis tasks such as fuzz testing, intrusion detection and security verification. Protocol state machine generation is one of the high-level goals in protocol reverse engineering. Different from field extraction and semantic recognition at the structural level, state machine generation focuses on the sequence relationship, response dependency and state dependency between messages, and reveals the behavior boundary of the protocol in actual communication by modeling the influence of input messages on the system state. Since ICPs generally lack document support and state behavior mostly depends on internal device logic, traditional rule-based or manually constructed state diagram methods are inefficient, easily affected by subjective influences, and difficult to adapt to the dynamic modeling needs of complex protocols. Therefore, studying automated state machine generation methods can improve the integrity and scalability of protocol behavior modeling, which is especially important in applications such as ICP state fuzz testing, anomaly detection and device simulation.
In response to the state modeling problem of PLC programs, Herbert Prähofer et al. proposed a state machine automatic generation and visualization method based on reactive behavior [82]. This method uses symbolic execution technology to traverse the path of the PLC program, identify the dependency between variables and control flow, and build a finite state machine (FSM) model of the program based on conditional branches and input triggers, systematically restoring the behavior evolution path of the PLC when facing different inputs. At the same time, the author introduced a structured state diagram visualization mechanism to clearly present the complex state transition logic, which is convenient for developers and analysts to identify potential abnormal logic and behavior blind spots. This work is not only suitable for control program reversal in scenarios where source code is not available, but also provides a general tool path for state modeling and logic verification in ICSs, and has strong representativeness and practical value at the application level of protocol state modeling.
The above method achieves high-precision behavior modeling at the PLC program level, but it relies on static analysis and symbolic execution of control logic, and still has certain limitations in the anomaly detection and behavior generalization capabilities of dynamic interaction processes. In addition, this method mainly serves the scenario where the program structure can be obtained, and it is difficult to directly generalize to the network layer scenario where only communication traffic is available. To make up for the above shortcomings, subsequent studies have attempted to extend the state modeling capability to the communication level, and to dynamically capture the system operation status by modeling the temporal pattern of protocol interaction behavior. Based on this, Kyriakos Stefanidis proposed an anomaly detection mechanism based on hidden Markov models (HMMs), which can learn and restore the protocol state transition characteristics from network data without the need for control program information, while improving system adaptability and taking into account real-time detection capabilities [83]. By modeling the state transition pattern of normal ICP communication flows, efficient identification of abnormal behaviors can be achieved. The authors first collected network traffic data from real SCADA systems and extracted observation sequences represented by feature vectors to train HMM to capture the temporal laws and state dependencies of protocol communication. After training, the model can calculate the generation probability of the input sequence. Sequences that are significantly lower than the normal range are judged as abnormal, indicating that the system may be attacked or the operation status is abnormal. The experimental evaluation was conducted in a simulated ICS environment, and typical attack scenarios such as DoS attacks, command injections, and parameter tampering were tested. The results showed that the method has good detection accuracy and real-time performance, and is independent of the specific protocol format, and is applicable to multiple common ICPs. In addition, the authors also analyzed the impact of model complexity on detection accuracy and latency, verified the deployability and versatility of HMM in actual industrial scenarios, and emphasized that the dynamic adaptability and high robustness of state modeling-based methods compared to static rules or signature detection methods are an important part of the active defense mechanism for SCADA systems.
This method can realize automatic learning of state patterns without relying on protocol details, and exhibits strong adaptability and detection performance. In order to further reveal the structural defects and state control vulnerabilities in protocol implementation and actively explore abnormal paths in the protocol state space, the application effect in the verification of complex protocol implementation can be improved. Joeri de Ruiter et al. further proposed a protocol state derivation method based on active interaction [84]. They used the active learning algorithm in the LearnLib tool to send a carefully designed message sequence to the target system through black box interaction, and derived the state machine model of the protocol based on the response. Through this process, the method can reveal hidden redundant states, non-standard paths, and unauthorized state migrations, thereby discovering the deviation between the actual protocol implementation and the expected specification. It works well in ICS scenarios, especially the ability to trigger unexpected state transfer paths through variant interactions, demonstrating the powerful coverage and vulnerability discovery capabilities of active interaction in protocol state reversal.
4.7. Applications of ICP reverse engineering
In the research link of ICP reverse engineering, the protocol reverse application stage mainly refers to the use of the previously completed protocol format parsing, field semantic recognition and state modeling results to serve the network security protection and system operation assurance tasks in actual industrial scenarios, typically including key directions such as protocol consistency detection, protocol fuzz testing and protocol reverse protection. This stage is not only an important link to verify the practicality of the protocol reverse results, but also reversely promotes the interpretability and refinement requirements of the previous modeling work. Since ICS mostly run in high-reliability and strong real-time environments, and the attack methods they face are becoming increasingly complex and hidden, traditional protection methods based on feature rules or signature mechanisms are difficult to deal with the hidden logical defects and undefined behaviors in the protocol implementation. Therefore, the application method built based on protocol reverse technology can restore the protocol behavior specifications without source code and documents, thereby realizing the accurate identification of abnormal communications and the active mining of unknown threats. At the same time, the protocol reverse application can also support the construction of a security testing platform, and systematically verify the protocol implementation through fuzz testing technology, timely discover vulnerabilities such as state jump errors and field parsing anomalies, and improve the overall robustness and defense capabilities of ICSs. With the continuous improvement of the protocol reverse analysis method system, its application capabilities are gradually expanding from security assessment to multiple dimensions such as protocol consistency monitoring, attack tracing support and dynamic access control, becoming an irreplaceable technical support in the industrial network security system.
4.7.1. Protocol consistency detection
Protocol consistency detection is a process of statically verifying the protocol format, field meaning and state logic obtained by reverse engineering, aiming to confirm that the parsing results are structurally consistent with the real protocol behavior. For example, the parsed protocol field structure is compared with the actual captured message in the network to check whether the field boundaries match, whether the field values are reproducible, and whether the state transition is logical. This process helps to discover boundary errors, field offset inaccuracies or type misjudgment problems in the reverse process.
Conventional cybersecurity research focuses on memory attacks on control logic programs, application protection passwords, configurations, and firmware [85]. When migrating to the industrial control field, it relies heavily on ICS network traffic or data logs and is difficult to provide runtime detection. Regarding this aspect Yangyang Geng et al. proposed a PLC memory integrity detection framework (PLC-READER) based on reverse engineering, which can detect memory attacks on PLCs in ICS [46]. Existing ICS defense solutions often rely on code integrity verification, physical process-based anomaly detection, or trusted computing technology (such as TPM). PLC-READER provides a solution that does not require additional hardware and is applicable to multi-vendor PLCs. This solution uses disassembly tools (such as IDA, dnSpy) to analyze PLC engineering software, extract memory access function codes of proprietary protocols (such as UMAS, S7COMM, PCCC), and combine traffic analysis to analyze the interaction logic between PLC and engineering software to locate key memory areas such as variable data, configuration files, application protection passwords, and firmware. Hash checksum comparison is then used to detect and respond to anomalies.
The PLC-READER method automatically identifies the key memory areas of the PLC and performs hash checks to detect memory tampering by disassembling and analyzing engineering files and ICP traffic. It has the advantages of not requiring hardware support and being adaptable to a variety of vendor platforms. Compared with traditional methods that rely on physical modeling or TPM devices, it has a lower deployment threshold and is more practical. It can maintain a certain level of detection accuracy while taking into account adaptability and versatility. However, this method still relies on the availability of engineering software and the accuracy of protocol field reversal. If the manufacturer uses an obfuscation mechanism or the key logic is not reflected in the engineering file, it may affect the identification integrity and detection coverage. In addition, the ability to respond to high-frequency or rapidly changing attack behaviors remains to be verified. Overall, PLC-READER provides a feasible memory integrity detection path while ensuring deployment flexibility, and has significant advantages in multi-source information fusion and lightweight defense.
4.7.2. Protocol fuzz testing
Protocol fuzz testing aims to uncover vulnerabilities in protocol implementations by generating inputs that are structurally valid but semantically perturbed. Common fuzzing strategies include bit flipping (randomly changing bits to test boundary robustness), edge case testing (injecting maximum, minimum, or null values), state-aware mutation (altering messages based on expected protocol states), and format-preserving fuzzing (modifying specific fields without violating message format). These techniques help simulate potential attack paths such as buffer overflows, illegal transitions, or unexpected system responses.
Based on the basic principle of generating structurally valid but semantically perturbed messages, Jean-Baptiste Bédrune et al. conducted in-depth reverse engineering on the private communication protocol of a mainstream SCADA system, extracted the protocol construction logic and communication fields in its DLL module, and combined the GUI operation mutation trigger to generate a protocol perturbation data packet [47]. Based on the SCADA protocol structure obtained by reverse engineering, protocol-level abnormal construction and behavior testing were performed to identify vulnerabilities and boundary processing defects in the protocol implementation. In particular, how the system handles requests that are legal in structure but abnormal in semantics. This method emphasizes the combination of structural legitimacy and control field mutation, and demonstrates the application value of protocol reverse work in structure-aware fuzz testing methods.
4.7.3. Protocol reverse protection
Protocol reverse protection is a research direction that has emerged in recent years to address the increasingly prominent industrial control security issues. Its goal is not to parse unknown protocols, but to prevent private protocols from being reversed and restored by attackers for security or confidentiality purposes. The protection work is carried out from multiple angles such as system design, data layer disguise, and control program obfuscation, in order to protect the confidentiality and non-tamperability of the protocol structure, control logic or sensitive industrial data, so as to increase the threshold and cost of protocol reversal and guide the reversal results in the wrong direction without burdening the protocol itself.
In this context, the ObfCP framework proposed by Shalini Banerjee et al. serves as a representative solution focused on hindering semantic extraction from control programs [48]. The ObfCP framework aims to prevent attackers from extracting process semantics (such as the Reditus framework in the data acquisition and preprocessing classification) by intercepting the copy of the control program downloaded to the PLC through encryption obfuscation. This method assumes that the adversary has MATE capabilities, defines the abstract form of control program assets and inserts inductive logic and dead code, such as converting the original control logic expression into an abstract version with low semantic transparency, and proposes an encryption deployment scheme supported by a trusted execution environment to hinder offline analysis, comprehensively improving the difficulty and cost of protocol reversal.
The ObfCP method effectively increases the difficulty of reverse engineering industrial control programs during transmission and static analysis by introducing semantic obfuscation and encryption deployment mechanisms for control programs. This method effectively weakens the analyzability of control logic by inserting indirect logic and redundant dead code with low semantic transparency, and combines the trusted execution environment (TEE) to realize the encrypted operation of key logic segments, further raising the threshold of offline analysis. However, its obfuscation strategy may still expose semantic features when facing advanced reverse engineering techniques such as dynamic symbolic execution, and its introduction of complex logic and encryption and decryption operations may have a certain impact on the runtime performance of PLCs, especially in resource-constrained edge control devices. There are challenges in deployment. At the same time, its protection granularity is mainly concentrated at the control program level, and has not yet covered the protection of protocol communication structure and data field level, making it difficult to form a comprehensive protocol reverse protection system.
To fill this gap, Arvind Sundaram et al. proposed the DIOD method to expand the protection scope to industrial data used in artificial intelligence/machine learning environments [49]. By converting metadata structures into abstract formats within common templates, DIOD aims to reduce the risk of reverse engineering by downstream learning systems. These works collectively embody a multi-layered defense strategy against protocol reversal, covering both logic-level and data-level obfuscation methods to prevent sensitive industrial data from being reversed in artificial intelligence and machine learning applications. This method divides data into “basic metadata” (system identification) and “inference metadata” (analysis-related features), and decouples and camouflages “basic metadata” (such as system identity, location information, etc.) and “inference metadata” (such as control features, operating mode, etc.) to project real data onto a general system template, so that it can hide specific implementation details while maintaining some analysis capabilities, thereby effectively blocking protocol reversal based on semantic recognition. DIOD introduces a semantically preserved replacement mechanism at the data structure level to ensure a balance between privacy protection and downstream applications, and is verified on multiple industrial data sets. Experimental results show that this method can significantly reduce data reversibility, while supporting AI training and analysis while ensuring the security of system sensitive information. This method demonstrates an effective path to implement the “available but unrecognizable” data protection strategy in data sharing and model training scenarios, and provides a new technical dimension for ICP reverse protection.
5. Future work
Based on the analysis above, we believe current research in the field of ICP reverse engineering faces several key limitations. Future work should focus on overcoming these issues in order to advance both the practical value of this field.
5.1. Fragmenting evaluation system
Evaluation system fragmentation refers to the fact that in different research works, different evaluation indicators and test data sets are used in the experimental phase of methods for similar or identical protocol reverse engineering targets.
Some of these data sets are simulated in experimental environments, some are from real traffic, and some are even not public and difficult to reproduce. Different types of reverse engineering methods have different performances on data sets with different data characteristics, which also reduces the objectivity of the effect comparison experiment between different reverse engineering methods.
For example, the data set used in the article A Practical Format and Semantic Reverse Analysis Approach for Industrial Control Protocols to compare the results with other methods is not public and can only be obtained by contacting the author [77]. Therefore, the indicators such as field segmentation accuracy and semantic reverse accuracy obtained are difficult to verify from the outside. The following Table 4 summarizes the dataset publicness of the methods introduced in this paper.
Summary of datasets accessible in existing tools
In the Dataset source in Table 4, “N” means that the data used in the evaluation comes from a publicly published online dataset, and “E” means that the data comes from the author’s experiment or from real-world measurements. “Whether dataset can be obtained” evaluates whether the literature provides an accurate and public way to obtain the dataset, such as a specific URL.
It can be seen that only 14% of the literature only uses public datasets on the Internet, while 58% of the authors do not provide easy-to-access dataset sources. This lack of openness poses an obstacle to reproducibility and cross-comparability in the field of ICP reverse engineering. Without standardized or accessible datasets, it is difficult to verify the claimed performance of different methods, set up experimental environments, or conduct fair benchmarking across tools. In addition, the lack of a shared evaluation corpus hinders the accumulation of empirical knowledge and limits the transferability of research results to real-world scenarios. To promote credible and sustainable progress, protocol reverse engineering work requires open, well-documented, and representative datasets to benchmark protocol reverse engineering techniques under different industrial conditions. Therefore, building a standardized and reproducible evaluation benchmark data set is one of the key issues that need to be urgently addressed in ICP reverse research.
This question is also reflected in the formulation of reverse effect evaluation indicators. Even for different reverse methods for the same reverse object, the effectiveness evaluation in the experimental phase may still adopt different granularity and evaluation criteria for different characteristics, especially in the reverse framework based on heuristic methods.
For example, when evaluating the message clustering capability, the article Clustering method in protocol reverse engineering for industrial protocols uses “Conciseness” and “Coverage” for evaluation, while the article Density peak-based clustering of industrial control protocols for reverse engineering uses “purity” and “F score” [58, 59]. The following Table 5 summarizes the evaluation indicators used by the methods introduced in this paper.
Summary of evaluation indicators
From the above analysis, we can see that Precision, Recall, and F1 score are the more commonly used evaluation indicators in existing work, but overall the indicators are still too diverse. Some studies use subjective or highly customized indicators, which further aggravates the problem of fragmentation of the evaluation system. The lack of a unified evaluation index has also affected the horizontal comparison and technical accumulation between methods to a certain extent. In addition, different evaluation indicators reflect different optimization goals and should be selected according to the specific characteristics and priorities of the reverse engineering task. For example, when the cost of false positives are high (such as incorrectly labeling control fields that trigger key operations), Precision should be emphasized to ensure the correctness of the identified fields. In contrast, Recall is more important in tasks where the lack of key fields or behaviors may lead to incomplete protocol recovery or missed vulnerabilities. The F1 score, as the harmonic mean of Precision and Recall, is suitable for scenarios where a balance needs to be struck between false positives and missed negatives. Therefore, future research should not only strengthen the construction of standard evaluation systems and indicator implementation standards, but also provide a context-aware argument for indicator selection, and promote the reverse engineering research of solutions from “experimental environment” to “engineering application”. Building a unified, repeatable, and task-oriented evaluation indicator system is an important direction to promote standardization in this field and improve research comparability.
Future research should increase attention to the construction of a standard evaluation system and the direction of index implementation standards to promote protocol reverse research from “experimental environment” to “engineering application”. Therefore, building a unified, reproducible, and task-oriented evaluation indicator system is an important direction to promote standardization in this field and improve research comparability.
5.2. Alleviating the strong assumptions
Existing methods usually use strong assumptions for data source acquisition as the starting step of ICP reverse engineering. For example, the control communication message between PLC/RTU firmware and its host computer has been obtained, the working plane of intranet communication devices such as SCADA and PLC has been entered, binary files and operation records can be extracted from actual actuators and controllers, the target device port can be freely accessed to achieve active interaction with the target device (even if the request message is dense and does not meet the protocol standard).
These conditions may be met in controlled simulation experimental environments and engineering debugging or operation and maintenance stages, but may be difficult to apply in industrial systems that are actually deployed and running (especially high-security devices using closed-source protocols) and reverse scenarios from the attacker’s perspective. In particular, in methods based on active interaction and firmware extraction. In real-world environments, especially those involving critical infrastructure or high-security industrial sectors, direct access to firmware, memory, or unrestricted interaction with field devices is typically prohibited by strict physical security measures, network segmentation, and access control policies. Moreover, many production ICS networks operate under minimal disruption principles, where any unauthorized probing or active interaction is considered a risk to system stability and safety. These factors collectively render many data acquisition assumptions unrealistic outside of controlled laboratory settings or specific maintenance windows. Therefore, the practical applicability of these reverse engineering methods remains limited when faced with real-world operational constraints. There is a huge gap between the “preset available” data source and the “actually difficult to obtain” objective situation.
Assumptions such as full access to firmware, unrestricted control over network communications, or precise timing of device interactions often simplify the problem space and promote algorithmic innovation, but they severely limit the applicability of proposed methods in real-world industrial control environments, where access rights, system heterogeneity, and operational constraints impose strict restrictions. Therefore, future research should aim to design methods that are both practical and theoretically rigorous, and explicitly consider partial observability, noisy data, and limited interaction permissions. This can be achieved through lightweight passive inference, hybrid learning models that fuse limited supervision and prior knowledge, and scenario-aware evaluation benchmarks that reflect real-world deployment conditions. These efforts will help ensure that protocol reverse engineering techniques remain effective in research and applicable in practice.
5.3. Enhancing semantic analysis
The surface structure here refers to the fact that existing methods for reversing protocols usually stop at dividing fields and identifying roles such as “length field”, “address field”, “function code field” or “data field”. The deep semantics refers to the fact that many proprietary protocol fields oriented to industrial control often have more subdivided meanings.
For example, in the DNP3 protocol, a large number of structured data objects are defined in its data object library. The group field is 30, indicating that the input mode is analog input, and the variant field is 2, indicating that the data contains 32-bit integer values and telemetry data of the quality bit. In this context, the field “Value” actually represents analog data (such as voltage, current, pressure, etc.), while the field “Quality” reflects the trusted state of the data in the physical world (such as whether the device is online, whether the value is out of bounds, etc.). This fine-grained protocol field semantics is far from being described by a single label such as “data field” obtained by reverse engineering.
In the existing ICP reverse research, although a large number of methods have made great progress in field division and semantic classification, they often lack a deep understanding and modeling ability of the real semantics of the fields in specific protocol scenarios. This makes protocol reverse lack the ability to interpret semantics for detailed scenarios, and it is difficult to answer the following key questions:
“What kind of industrial object does this field control?”
“What does the value of 100 mean? Is it voltage, current, or other values?”
The lack of deep semantic reverse research has limited the ability to deepen the high-level tasks such as protocol simulation, anomaly detection, and protocol behavior prediction to a certain extent.
In summary, future work should move from simple structural recovery to deep semantic reasoning of scene perception, especially the interactive context between different fields, so as to achieve the reverse leap of protocol fields from “location + type” to “scenario + semantics”.
5.4. Integration of tools
Specifically, although many innovative methods have been proposed in recent years in the field division, key field identification, semantic derivation, state machine modeling and other aspects of industrial control protocol reverse research, these methods often stop at the academic prototype or algorithm verification stage, lacking system integration and tool implementation for engineering practice [86]. In other words, “identify fields and drawing state diagrams” does not mean that the protocol reverse task is truly completed. This work can be further developed to explore how the protocol structure is integrated into actual analysis tools, how it is used for traffic decoding, message generation, attack detection, intrusion analysis, and industrial control simulation [87, 87].
In the protocol reverse application chapter, we introduced several representative research works, which showed strong implementation in terms of “application scenarios”, but still did not fully cover the “reverse tool platform integration” and “versatility and automation” levels. Most of these works are oriented towards specific goals, lacking a unified protocol modeling framework, standard semantic interface, and modular integration design. There are regrets in the direction of callable, portable, and reusable results. They are more of a task-driven tool and fail to achieve modular universal protocol reverse tool integration.
We believe that researchers can refer to the modular and extensible frameworks of STIX and TAXII in this regard to share intelligence in a structured, machine-readable form, thereby achieving interoperability between different analytical tools and platforms. As well as the form of MISP to define standardized schemas for event-based threat data sharing, thereby promoting data consistency, semantic clarity, and tool integration. In the context of reverse engineering of ICPs, similar efforts can be made to develop standardized intermediate representations and modular analysis interfaces. Such modularity will allow researchers and practitioners to plug and play different algorithmic modules on a unified framework, share intermediate outputs and benchmark methods, and thus accelerate the convergence and actual deployment of methods. From the perspective of reverse results integration, future work should focus on building a universal protocol reverse integration platform to achieve a complete closed loop from “protocol parsing to protocol utilization”.
6. Conclusion
ICP reverse engineering technology is one of the main guarantees for maintaining the security of ICS. Through the efforts of data acquisition, message clustering, field division, key field identification, semantic deduction and state machine modeling, the overall security situation of the system can be optimized. This paper systematically summarizes the current mainstream protocol reverse research directions and technical achievements around the structural characteristics of typical ICPs and common dimensions of protocol reverse engineering, and discusses its application scenarios and practical value from the perspective of practical application of protocol reverse engineering. However, current research still has some regrets in the direction of semantic deepening, and these bottlenecks restrict the generalization ability and practical performance of protocol reverse engineering results to a certain extent. Future research should be committed to the construction of a standardized evaluation system, a protocol structure inference method for weak hypothesis environments, and the deep integration of semantic deduction and specific scenarios to achieve the toolization and integration of reverse engineering results, and promote protocol reverse engineering to further move towards practical application.
Acknowledgments
Thanks to anonymous reviewers for their hard work and kind help.
Funding
This work was supported by the Natural Science Foundation of China (no. 62303126, no. 62362008), the Major Scientific and Technological Special Project of Guizhou Province([2024]014), and the Open Research Project of the State Key Laboratory of Industrial Control Technology, Zhejiang University, China (no. ICT2025B39).
Conflicts of interest
The author declare no conflicts of interest.
Data availability statement
No data are associated with this article.
Author contribution statement
Yuheng Wu wrote the main part of this paper. Zhenyong Zhang guided the overall direction of the work, revised the manuscript, and provided important suggestions. Zheqiu Hetu contributed supplementary content to this paper. Xinyu Cheng assisted in revising the manuscript and provided constructive feedback. Peng Cheng offered academic guidance for the research framework and gave key suggestions for improving the manuscript.
References
- Cucinotta T, Mancina A and Anastasi GF et al. A real-time service-oriented architecture for industrial automation. IEEE Trans Indust Inf 2009; 5: 267–77. [Google Scholar]
- Liu L, Xu Z and Qu X. A reconfigurable architecture for industrial control systems: Overview and challenges. Machines 2024; 12: 793. [Google Scholar]
- McLaughlin S, Konstantinou C and Wang X et al. The cybersecurity landscape in industrial control systems. Proc IEEE 2016; 104: 1039–57. [Google Scholar]
- Kim KH, Kwak BI and Han ML et al. Intrusion detection and identification using tree-based machine learning algorithms on DCS network in the oil refinery. IEEE Trans Power Syst 2022; 37: 4673–82. [Google Scholar]
- Babayigit B and Abubaker M. Industrial internet of things: A review of improvements over traditional SCADA systems for industrial automation. IEEE Syst J 2024; 18: 120–33. [Google Scholar]
- Zhang Z, Deng R and Tian Y et al. SPMA: stealthy physics-manipulated attack and countermeasures in cyber-physical smart grid. IEEE Trans Inf Forensics Secur 2023; 18: 581–96. [Google Scholar]
- Ike M, Phan K and Sadoski K et al. Scaphy: detecting modern ICS attacks by correlating behaviors in SCADA and physical systems. In: Proc. 2023 IEEE Symposium on Security and Privacy (SP), 2023, 20–37. [Google Scholar]
- Kayan H, Nunes M and Rana O et al. Cybersecurity of industrial cyber-physical systems: A review. ACM Comput Surv 2022; 54: 229. [Google Scholar]
- Galloway B and Hancke GP. Introduction to industrial control networks. IEEE Commun Surv Tutorials 2013; 15: 860–80. [Google Scholar]
- Grand View Research. Industrial Automation and Control Systems Market Size, Share & Trends Analysis Report by Component (Industrial Robots, Control Valves), by Control System (DCS, PLC, SCADA), by End-use, by Region, and Segment Forecasts, 2025–2030. Report ID: GVR-4-68038-130-6, https://www.grandviewresearch.com/, last accessed 29 Mar. 2025. [Google Scholar]
- Wang FY. New control paradigm for industry 5.0: From big models to foundation control and management. IEEE/CAA J Autom Sin 2023; 10: 1643–46. [Google Scholar]
- Sasaki T, Fujita A, Gañán C H, et al. Exposed Infrastructures: Discovery, Attacks and Remediation of Insecure ICS Remote Management Devices. In: Proc. of 2022 IEEE Symposium on Security and Privacy (SP), 2022, 2379–96. [Google Scholar]
- Asghar MR, Hu Q and Zeadally S. Cybersecurity in industrial control systems: Issues, technologies, and challenges. Comput Networks 2019; 165: 106946. [Google Scholar]
- Zhou CJ, Li XH and Yang SH et al. Risk-based security task scheduling in industrial control systems considering safety. IEEE Trans Indust Inf 2020; 16: 3112–23. [Google Scholar]
- Zhang Q, Zhou CJ and Xiong NX et al. Multimodel-based incident prediction and risk assessment in dynamic cybersecurity protection for industrial control systems. IEEE Trans Syst Man Cybern: Syst 2016; 46: 1429–44. [Google Scholar]
- Mohammed AS, Saxena N and Rana O. Wheels on the modbus – attacking ModbusTCP communications. In: Proceedings of the 15th ACM Conference on Security and Privacy in Wireless and Mobile Networks, ACM, 2022, 288–89. [Google Scholar]
- Chan A and Zhou J. Non-intrusive protection for legacy SCADA systems. IEEE Commun Mag 2023: 1–7. [Google Scholar]
- McLaughlin SE. Specification-based attacks and defenses in sequential control systems. Ph.D. Thesis, Pennsylvania State University, 2014. [Google Scholar]
- Rrushi JL. SCADA protocol vulnerabilities. In: Critical Infrastructure Protection: Information Infrastructure Models, Analysis, and Defense. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, 150–76. [Google Scholar]
- Sija BD, Goo YH and Shim KS et al. A survey of automatic protocol reverse engineering approaches, methods, and tools on the inputs and outputs view. Secur Commun Networks 2018; 2018: 8370341. [Google Scholar]
- Narayan J, Shukla SK and Clancy TC. A survey of automatic protocol reverse engineering tools. ACM Comput Surv (CSUR), 2015; 48: 1–26. [Google Scholar]
- Huang Y, Shu H and Kang F et al. Protocol reverse-engineering methods and tools: A survey. Comput Commun 2022; 182: 238–54. [Google Scholar]
- Rosa L, Freitas M and Mazo S et al. A comprehensive security analysis of a SCADA protocol: From OSINT to mitigation. IEEE Access 2019; 7: 42156–42168. [Google Scholar]
- Meng J, Yang ZY and Zhang ZY et al. SePanner: Analyzing semantics of controller variables in industrial control systems based on network traffic. In: Proceedings of the 39th Annual Computer Security Applications Conference, ACM, 2023, 310–23. [Google Scholar]
- Liao GY, Chen YJ and Lu WC et al. Toward authenticating the master in the Modbus protocol. IEEE Trans Power Delivery 2008; 23: 2628–29. [Google Scholar]
- Cervelión Bastidas AJ, Agredo Méndez GL and Revelo-Fuelagán J et al. Performance evaluation of modbus and DNP3 protocols in the communication network of a university campus microgrid. Results Eng 2024; 24: 103656. [Google Scholar]
- Ortiz N, Cardenas AA and Wool A. A taxonomy of industrial control protocols and networks in the power grid. IEEE Commun Mag 2023; 61: 21–7. [Google Scholar]
- Alsabbagh W and Langendörfer P. You are what you attack: Breaking the cryptographically protected S7 protocol. In: 2023 IEEE 19th International Conference on Factory Communication Systems (WFCS), 2023, 1–8. [Google Scholar]
- Kjellsson J, Vallestad AE and Steigmann R et al. Integration of a wireless I/O interface for PROFIBUS and PROFINET for factory automation. IEEE Trans Indust Electron 2009; 56: 4279–87. [Google Scholar]
- Majdalawieh M, Parisi-Presicce F and Wijesekera D. DNPSec: Distributed network protocol version 3 (DNP3) security framework. Advances in Computer, Inf Syst Sci Eng 2006; 1: 227–34. [Google Scholar]
- Clarke G, Reynders D and Wright E. Practical modern SCADA protocols: DNP3, 60870.5 and related systems. Newnes, 2004. [Google Scholar]
- Drias Z, Serhrouchni A and Vogel O. Analysis of cyber security for industrial control systems. In: 2015 International Conference on Cyber Security of Smart Cities, Industrial Control System and Communications (SSIC), IEEE, 2015, 1–8. [Google Scholar]
- Stouffer K, Falco J and Scarfone K. Guide to industrial control systems (ICS) security. NIST Spec Pub 2011; 800: 16–16. [Google Scholar]
- Igure VM, Laughter SA and Williams RD. Security issues in SCADA networks. Comput Secur 2006; 25: 498–506. [Google Scholar]
- Knowles W, Prince D and Hutchison D et al. A survey of cyber security management in industrial control systems. Int J Crit Infrastruct Prot 2015; 9: 52–80. [Google Scholar]
- Cárdenas AA, Amin S and Sastry S. Research challenges for the security of control systems. HotSec 2008; 5: 1158. [Google Scholar]
- Seshadri SS, Rodriguez D and Subedi M et al. Iotcop: A blockchain-based monitoring framework for detection and isolation of malicious devices in Internet-of-Things systems. IEEE Internet Things J 2020; 8: 3346–59. [Google Scholar]
- Cui A and Stolfo SJ. A quantitative analysis of the insecurity of embedded network devices: results of a wide-area scan. In: Proceedings of the 26th Annual Computer Security Applications Conference, ACM, 2010, 97–106. [Google Scholar]
- Cui A, Costello M and Stolfo SJ. When firmware modifications attack: A case study of embedded exploitation. NDSS Symp 2013; 1: 1.1–8.1. [Google Scholar]
- Caballero J, Poosankam P and Kreibich C et al. Dispatcher: Enabling active botnet infiltration using automatic protocol reverse-engineering. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, ACM, 2009, 621–34. [Google Scholar]
- Wei X, Yan Z and Liang X. A survey on fuzz testing technologies for industrial control protocols. J Network Comput Appl 2024. [Google Scholar]
- Konstantinou C and Maniatakos M. Impact of firmware modification attacks on power systems field devices. In: 2015 IEEE International Conference on Smart Grid Communications (SmartGridComm), IEEE, 2015, 283–88. [Google Scholar]
- Kleber S, Maile L and Kargl F. Survey of protocol reverse engineering algorithms: Decomposition of tools for static traffic analysis. IEEE Commun Surv Tutorials 2018; 21: 526–61. [Google Scholar]
- Huang Y, Shu H and Kang F et al. Protocol reverse-engineering methods and tools: A survey. Comput Commun 2022; 182: 238–54. [Google Scholar]
- Lifa W, Chen W and Zheng H et al. Overview on protocol state machine inference: a survey. Appl Res Comput 2015; 32: 1931–1936. [Google Scholar]
- Geng Y, Chen Y and Ma R et al. Defending cyber–physical systems through reverse-engineering-based memory sanity check. IEEE Internet Things J 2022; 10: 8331–47. [Google Scholar]
- Bédrune JB, Gazet A and Monjalet F. Supervising the supervisor: Reversing proprietary SCADA tech. In: Hack In The Box Security Conference, 2015. [Google Scholar]
- Banerjee S, Galbraith SD and Khan T et al. Preventing reverse engineering of control programs in industrial control systems. In: Proceedings of the 9th ACM Cyber-Physical System Security Workshop, ACM, 2023, 48–59. [Google Scholar]
- Sundaram A, Abdel-Khalik HS and Abdo MG. Preventing reverse engineering of critical industrial data with DIOD. Nucl Technol 2023; 209: 37–52. [Google Scholar]
- Nawrocki M, Schmidt TC and Wählisch M. Uncovering vulnerable industrial control systems from the internet core. In: NOMS 2020–2020 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2020, 1–9. [Google Scholar]
- Luo Z, Liang K and Zhao Y et al. DynPRE: Protocol reverse engineering via dynamic inference. In: Network and Distributed System Security Symposium (NDSS), 2024, 1–18. [Google Scholar]
- Keliris A and Maniatakos M. ICSREF: A framework for automated reverse engineering of industrial control systems binaries, arXiv preprint [arXiv: https://arxiv.org/abs/1812.03478], 2018. [Google Scholar]
- Qasim SA, Smith JM and Ahmed I. Control logic forensics framework using built-in decompiler of engineering software in industrial control systems. Forensic Sci Int Digital Invest 2020; 33: 301013. [Google Scholar]
- Geng Y, Che X and Ma R et al. Control logic attack detection and forensics through reverse-engineering and verifying PLC control applications. IEEE Internet Things J 2023; 11: 8386–400. [Google Scholar]
- Zaddach J, Bruno L and Francillon A et al. AVATAR: A framework to support dynamic security analysis of embedded systems’ firmwares. NDSS, 2014; 14: 1–16. [Google Scholar]
- Costin A, Zarras A and Francillon A. Automated dynamic firmware analysis at scale: a case study on embedded web interfaces. In: Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, ACM, 2016, 437–48. [Google Scholar]
- Bossert G, Guihéry F and Hiet G. Towards automated protocol reverse engineering using semantic information. In: Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, ACM, 2014, 51–62. [Google Scholar]
- Shim KS, Goo YH and Lee MS et al. Clustering method in protocol reverse engineering for industrial protocols. Int J Network Manage 2020; 30: e2126. [Google Scholar]
- Tong D and Wang Y. Density peak-based clustering of industrial control protocols for reverse engineering. In: International Conference on Cryptography, Network Security, and Communication Technology (CNSCT 2022), SPIE, vol. 12245, 2022, 64–9. [Google Scholar]
- Ji Y, Huang T and Ma C et al. IMCSA: Providing better sequence alignment space for industrial control protocol reverse engineering. Secur. Commun. Networks, 2022; 2022: 8026280. [Google Scholar]
- Luo X, Chen D and Wang Y et al. A type-aware approach to message clustering for protocol reverse engineering. Sensors, 2019; 19: 716. [Google Scholar]
- Sun Y, Li Z and Lv S et al. Spenny: Extensive ICS protocol reverse analysis via field guided symbolic execution. IEEE Trans Dependable Secure Comput 2022; 20: 4502–18. [Google Scholar]
- Beddoe MA. Network protocol analysis using bioinformatics algorithms. Toorcon, 2004; 26: 1095–98. [Google Scholar]
- Liu O, Zheng B and Sun W et al. A data-driven approach for reverse engineering electric power protocols. J Signal Process Syst 2021; 93: 769–77. [Google Scholar]
- Liu Y, Zhang F and Ding Y et al. Sub-messages extraction for industrial control protocol reverse engineering. Comput Commun 2022; 194: 1–14. [Google Scholar]
- Comparetti PM, Wondracek G and Kruegel C et al. Prospex: Protocol specification extraction. In: 2009 30th IEEE Symposium on Security and Privacy, IEEE, 2009, 110–25. [Google Scholar]
- Caballero J, Yin H and Liang Z et al. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In: Proceedings of the 14th ACM Conference on Computer and Communications Security, ACM, 2007, 317–29. [Google Scholar]
- Lin Z, Jiang X and Xu D et al. Automatic protocol format reverse engineering through context-aware monitored execution. In: NDSS, vol. 8, 2008, 1–15. [Google Scholar]
- Ma R, Zheng H and Wang J et al. Automatic protocol reverse engineering for industrial control systems with dynamic taint analysis. Front Inf Technol Electron Eng 2022; 23: 351–60. [Google Scholar]
- https://doczz.net/doc/4115870/d4.6-protocol-learning-for-ami-environments, last accessed 6 Jun. 2025. [Google Scholar]
- Wang X, Lv K and Li B. IPART: an automatic protocol reverse engineering tool based on global voting expert for industrial protocols. Int J Parallel Emergent Distrib Syst 2020; 35: 376–95. [Google Scholar]
- Bermudez I, Tongaonkar A and Iliofotou M et al. Towards automatic protocol field inference. Comput Commun 2016; 84: 40–51. [Google Scholar]
- Cui W, Kannan J and Wang HJ. Discoverer: Automatic Protocol Reverse Engineering from Network Traces. In: USENIX Security Symposium, 2007, 1–14. [Google Scholar]
- Wang Z, Jiang X and Cui W et al. Reformat: Automatic reverse engineering of encrypted messages. In: Computer Security–ESORICS 2009: 14th European Symposium on Research in Computer Security, Springer Berlin Heidelberg, 2009, 200–15. [Google Scholar]
- Ye Y, Zhang Z and Wang F et al. NetPlier: Probabilistic Network Protocol Reverse Engineering from Message Traces. In: Network and Distributed System Security Symposium (NDSS), 2021. [Google Scholar]
- Qin Z, Yang Z and Geng Y et al. Reverse Engineering Industrial Protocols Driven By Control Fields. In: IEEE INFOCOM 2024–IEEE Conference on Computer Communications, IEEE, 2024: 2408–17. [Google Scholar]
- Wang Q, Sun Z and Wang Z et al. A practical format and semantic reverse analysis approach for industrial control protocols. Secur Commun Networks 2021; 2021: 6690988. [Google Scholar]
- Qasim SA, Jo W and Ahmed I. Pree: Heuristic builder for reverse engineering of network protocols in industrial control systems. Forensic Sci Int Digital Invest 2023; 45: 301565. [Google Scholar]
- Yang Z, He L and Ruan Y et al. Unveiling Physical Semantics of PLC Variables Using Control Invariants. IEEE Trans Dependable Secure Comput, 2024. [Google Scholar]
- Hetu Z, Zhang Z and Wang M et al. CASI: Context-aware Automatic Semantic Inference by fusing video and network traffic information in industrial control systems. Inf Fusion 2025; 122: 103174. [Google Scholar]
- Ning B, Zong X and He K et al. PREIUD: An industrial control protocols reverse engineering tool based on unsupervised learning and deep neural network methods. Symmetry, 2023; 15: 706. [Google Scholar]
- Prähofer H, Wirth C and Berger R. Reverse engineering and visualization of the reactive behavior of PLC applications. In: 2013 11th IEEE International Conference on Industrial Informatics (INDIN), IEEE, 2013, 564–71. [Google Scholar]
- Stefanidis K and Voyiatzis AG. An HMM-based anomaly detection approach for SCADA systems. In: Information Security Theory and Practice: 10th IFIP WG 11.2 International Conference, WISTP 2016, Heraklion, Crete, Greece: Springer International Publishing, 2016, 85–99. [Google Scholar]
- De Ruiter J and Poll E. Protocol state fuzzing of TLS implementations. In: 24th USENIX Security Symposium (USENIX Security 15), 2015, 193–206. [Google Scholar]
- Alladi T, Chamola V and Zeadally S. Industrial control systems: Cyberattack trends and countermeasures. Comput Commun 2020; 155: 1–8. [Google Scholar]
- Duchêne J, Le Guernic C and Alata E et al. State of the art of network protocol reverse engineering tools. J Comput Virol Hacking Tech 2018; 14: 53–68. [Google Scholar]
- Hu Y, Sun Y and Wang Y et al. An enhanced multi-stage semantic attack against industrial control systems. IEEE Access, 2019; 7: 156871–882. [Google Scholar]

Yuheng Wu is currently studying at State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China. His research interests include network security and protocol reversal engineering.

Zhenyong Zhang is a distinguished professor at Guizhou University, China. He received his bachelors degree in Automation from Central South University and his Ph.D. degree in Control Science and Engineering from Zhejiang University, China. He has made outstanding contributions in the field of computer science and cybersecurity.

Zheqiu Hetu received the bachelor’s degree from Southwest Minzu University, China, in 2021. He is currently studying in the College of Computer Science and Technology, Guizhou University, China. His research interests include industrial control system security and protocol security.

Xinyu Cheng is currently a professor from Guizhou University, China. His research interests include artificial intelligence, computer vision, and software engineering.

Peng Cheng is currently a professor and the executive vice dean of the School of Control Science and Engineering at Zhejiang University, China. His research interests include industrial internet system security, Internet of Things, cyber-physical fusion systems, and data security and privacy protection.
All Tables
All Figures
![]() |
Figure 1. DNP3 message structure |
| In the text | |
![]() |
Figure 2. Link control structure |
| In the text | |
![]() |
Figure 3. Overview of ICP reverse engineering |
| In the text | |
![]() |
Figure 4. The framework for ICP reverse engineering |
| In the text | |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.



