# MICROPROCESSORS KOMDIV FOR HIGH-PERFORMANCE EMBEDDED SYSTEMS S.G. Bobkov Abstract—The problems of creating of high-performance embedded computing systems based on microprocessors KOMDIV is considered. Processor performance is dependent upon three characteristics: clock cycle, clock cycles per instruction, and instruction count. These characteristics for microprocessors KOMDIV are optimized using parameter performance/power consumption and requirements of embedded systems. Keywords—trusted systems; system on chip; microprocessors architecture; co-processor #### I. INTRODUCTION Presently there is a steady tendency for the increasing the intelligence of industrial management and monitoring systems. Progress of technological processes leads to a growing complexity of such systems. The task of digital signal processing becomes the main for many complex technological processes. In order to provide comprehensible calculation time some monitoring and control tasks are assigned to specialized processors. Complexity of technological processes and algorithms of processing leads to complication of computing elements and complication of their programming, and for some tasks complexity of the software exceeds greatly the complexity of the equipment engineering. The high level languages programming is basic for the solving of such tasks. Federal State Institution«Scientific Research Institute for System Analysis of the Russian Academy of Sciences» (SRISA)Moscow, Russian Federation, bobkov@cs.niisi.ras.ru Electronics Department, National Research Nuclear University «MEPhI»Moscow, Russian Federation Thus, there is a necessity for creation of the monitoring and management systems computing assemblies that possess high efficiency, ease of programming and meet rigid requirements of functioning modes and power consumption. This results in a fact that general-purpose microprocessors are assigned to the control and management tasks more often than specialized processors with the built-in dedicated computing assemblies focused on a certain class of tasks. Vivid example of such tendency is development of systems of a digital signal processing. Nowadays developers of new systems use one of the two most popular variants of digital processors: ADSP microprocessors from Analog Device or universal microprocessors with the built-in coprocessors based on AltiVec<sup>TM</sup> technology. AltiVec<sup>TM</sup> is Freescale's trademark for the first PowerPC SIMD extension. The first one is a traditional way of DSP processors. Microprocessors of the second way were jointly developed by Motorola, IBM, and Apple. ## II. MICROPROCESSOR KOMDIV KOMDIV is a SRISA trademark. In the article the architecture of microprocessor KOMDIV related to a class of general-purpose microprocessors with the built-in additional functions for a digital signal processing, is considered. SRISA has full cycle of ASIC design: - Design and production of ICs, up to 28 nm, - System software: compilers, real time OS, debuggers, etc. - Modules and systems, 5 The own research manufacturing fab: 0.5-0.25 μm bulk and SOI. SRISA has designed a number of microprocessors (Fig. 1). ISSN (Print): 2204-0595 ISSN (Online): 2203-1731 Fig. 1.SRISA microprocessors line. One of the most performant microprocessors 1890VM118 consists of two CPU Core and a system controller that allows collecting a microcircuit with a set of necessary interfaces (Fig. 2). Fig. 2. The block diagram of the microprocessor 1890VM118. The given microprocessor is close to ideology of the microprocessor with AltiVec processing technique. The processor combines functions of a high-speed general-purpose microprocessor and a DSP processor and can be used as complete system for high-performance embedded systems. From the high-speed general-purpose microprocessor point of view, the given microprocessor has following architectural features: Two 64-bit KOMDIV core, support of a 32-bit mode; Support Symmetric Multiprocessing Mode (SMP) and Asymmetric Multiprocessing Mode (AMP); - 1600 MFLOPS, double precision (for FPU); - 1000 Dhrystone 2.1 MIPS; - System controller: - Two DDR3/DDR3L SDRAM controller; - System switch; - 3D graphics core: PixelRate 2,8Gp/s, 250 Gflops, Open GLES 3.0, Open GLDesktop 2.1, Open CL1.1, MJPEG, MJPEG2K; ISSN (Print): 2204-0595 ISSN (Online): 2203-1731 - Four (3-Root Complex, 1-Dual Mode (RC+EP)), 4 x1; 2 x2 or 2 x4 PCIExpress 2.0; - Two Ethernet 10/100/1000; - Two SATA 3.0; - Audio: - Three (EHCI+UHCI) USB 2.0; - Two UART; - Four SPI; - Two I2C; - SMB controller; - Two CAN 2.0; - Thirty-two GPIO; - Four DMA (Chainmode, singleshot etc.); - EJTAG. The given microprocessor core has following architectural features (Fig. 3): - 1,3 GHz for temperature range from 50° C to +85° C - Big Endian and Little Endian support; - Superscalar mode, simultaneous execution of two integer operations, Load/Store instruction and floating point instructions; - The 7-stage pipeline with prefetch and reordering; - Separate L1 instructions and data cache, 32 KB each, 8-way, DICE memory cells: - A built-in L2 512 KB cache (with single error correction and double error detection), 4-way; - Three multiport separate register files; - Three modes of privileges: user, supervisor, kernel; - An associative translation look-aside buffer for virtual addresses (joint TLB) with 64 addresses; - Separate instructions and data cache translation lookaside buffers for virtual addresses (microTLB) with four addresses each. Besides the following schematic and topological decisions are implemented to increase a clock rate: - The project is divided into separate blocks of different performance; - Different engineering approaches are used for different types of blocks; - The most critical on high-speed performance blocks are developed at a transistor level with use of dynamic logic for memory; - Additional library elements are developed and added to the existing elements library, allowing increasing high-speed performance of the project; - Manual placement of the most speed critical blocks; - Manual routing of the most speed critical blocks are used. ## III. VECTOR CO-PROCESSOR For the embedded application use, the vector coprocessor has been implemented [2, 3]. Vector co-processor is designed to extend the functionality of a universal microprocessor in the tasks of digital processing. It performs arithmetic operations on complex and vector data types represented by floating-point numbers single and double precision. The main characteristics of the coprocessor are: - support for complex and vector data types, represented by floating-point numbers single and double precision; - the maximum vector width is 128 bits; - register file with 64 128-bit registers; - can run up to 10 arithmetic operations with real numbers in double precision and up to 20 arithmetic operations with real numbers single precision for one step (command multiplication with accumulation and subtraction of complex numbers); - advanced set of vector commands; - ability to load/store vectors through the L1 cache memory or a pair of vectors through the L2 cache memory. The coprocessor supports data formats: - a complex number double precision; - two floating-point numbers double precision; - two complex single-precision numbers; - four real numbers of single precision; - complex number of single precision; - two floating-point numbers of single precision. Such introduction gave us a number of opportunities for the digital processing tasks. One of the basic operations in the digital signal processing is the Fourier transform applied to a data stream. Therefore, to increase performance of the microprocessor it is appropriate to carry out a number of Fourier transform operations simultaneously. The problem of parallel execution of four transforms is considered below. The algorithm of fast Fourier transform with decimation in time is reduced to calculation of expressions of a kind $$\begin{pmatrix} A \\ B \end{pmatrix} = \begin{pmatrix} A + B \times W \\ A - B \times W \end{pmatrix}, \text{ where } A, B \text{ and } W \text{ - complex numbers.}$$ The given operation is usually referred to as butterfly. Let us designate $A_r = \text{Re } A$ , $A_i = \text{Im } A$ , $B_r = \text{Re } B$ , etc. Fourier butterfly is then written in a form of: $$\begin{pmatrix} A_r & A_i \\ B_r & B_i \end{pmatrix} = \begin{pmatrix} A_r + (B_r \times W_r - B_i \times W_i) & A_i + (B_r \times W_i - B_i \times W_r) \\ A_r - (B_r \times W_r - B_i \times W_i) & A_i - (B_r \times W_i - B_i \times W_r) \end{pmatrix}$$ It is supposed, that all data for four transforms are in the L1 cache memory. The processor simultaneously carries out four various Fourier transforms above 32-bit numbers. The data are stored in a cache memory in a following way: 7 $$A(1)^1$$ , $A(1)^2$ , $A(1)^3$ , $A(1)^4$ , $A(2)^1$ , $A(2)^2$ , $A(2)^3$ , $A(2)^4$ , $A(3)^1$ , ..., Where A (i) $^{j}$ is an $i^{th}$ element of the array A for transform of number j. Upon execution of Fourier butterfly the results are written to the same locations, as initial data. The considered calculations of transform can be organized as follows (all data for current iteration are loaded): $$\begin{split} &1.R_{20} \to \text{Mem } (B_i^{k.4}, B_i^{k.3}); \ R_{00}*R_{02} \to R_{10} \ (B_i^*W_i); \\ &R_{21} \to \text{Mem } (B_i^{k.2}, B_i^{k.1}); \\ &2.R_{22} \to \text{Mem } (B_r^{k.4}, B_r^{k.3}); \ R_{01}*R_{02} \to R_{11} \ (B_i*W_i); \\ &R_{23} \to \text{Mem } (B_r^{k.2}, B_r^{k.1}); \\ &3.A_i^{k.4}, A_i^{k.5} \to R_{24}; \\ &A_i^{k.6}, A_i^{k.7} \to R_{25}; \\ &4.A_r^{k.4}, A_r^{k.5} \to R_{26}; \\ &A_r^{k.4}, A_r^{k.5} \to R_{26}; \\ &A_r^{k.6}, A_r^{k.7} \to R_{27}; \\ &5.B_r^{k.4}, B_r^{k.5} \to R_{28}; \\ &B_r^{k.6}, B_r^{k.7} \to R_{29}; \\ \end{split}$$ Fig. 3. The block diagram of the microprocessor core. - 6. $R_{01}*R_{03} \rightarrow R_{11} (B_i*W_r);$ - 7. $R_{04}*R_{02}-R_{10} \rightarrow R_{14}$ - 8. $R_{05}*R_{02}-R_{11} \rightarrow R_{15}$ - 9. $W_i, W_i \rightarrow R_{02};$ $R_{06} \pm R_{12} \rightarrow R_{16}, R_{17}(A_i \pm C_i);$ $W_r, W_r \rightarrow R_{03};$ - 10. $B_i^{k+4}, B_i^{k+5} \rightarrow R_{00};$ $R_{07} \pm R_{13} \rightarrow R_{18}, R_{19}(A_i \pm C_i);$ $B_i^{k+6}, B_i^{k+7} \rightarrow R_{01};$ - 11. $R_{16} \rightarrow \text{Mem}(A_i^{,k}, A_i^{,k+1}); R_{08} \pm R_{12} \rightarrow R_{20}, R_{21}(A_r \pm C_r); R_{17} \rightarrow \text{Mem}(A_i^{,k+2}, A_i^{,k+3});$ - 12. $R_{18} \rightarrow \text{Mem } (A_r^k, A_r^{k+1}); R_{09} \pm R_{13} \rightarrow R_{22}, R_{23}(A_r \pm C_r); R_{19} \rightarrow \text{Mem } (A_r^{k+2}, A_r^{k+3});$ As a result four Fourier butterflies are carried out by 40 operations, up to 10 floating point operations of single precision are executed at one cycle ( $a^{1*}b^{1}\pm c^{1}$ ; $a^{2*}b^{2}\pm c^{2}$ ). ## IV. CONCLUSION Thus, it was possible to create the general-purpose microprocessor with the built-in functions for a digital signal processing. Unlike traditional digital signal processors 1890VM118 microprocessors can be programmed effectively in high level C language. It is possible to achieve essential performance increase at specific tasks by introducing small changes to general-purpose microprocessors. Some functions for a digital signal processing have been included in 1890VM118 microprocessor. The research has shown that similar improvements can be achieved for other tasks by corresponding development of the microprocessor. Such development process of the microprocessor is called cross optimization. #### REFERENCES - [1] Bobkov S.G. Import Substitution of the Circuitry of Computing Systems // Herald of the Russian Academy of Sciences, 2014, t. 84, № 11, pp. 1010–1016 - [2] Bobkov S. G., Aryashev S. I., Barskyh M. E., Zubkovskiy P. S., Ivasyuk E. V. High- Performance Extensions of Microprocessor Architecture for Speeding-Up of Scientific and Engineering Calculations // Informacionnye tehnologii, 2014, no. 6, pp. 27-37 (in Russian). - [3] S. I. Aryashev, S. G. Bobkov, P. S. Zubkovskiy, E.V. Ivasyuk Development of Compensated Addition Hardware Module to Improve Calculation Accuracy // Informacionnye tehnologii, 2015, no. 8, Vol.21, pp. 570-575 (in Russian).