1. Approach
Usual for programming on PC platform the floating point arithmetic is used. The decision is between single (float) or double. Whereas the existence of a double precision arithmetic processor suggest using double precession anytime. This topic should not be discussed in this article, only a simple hint with link:
For good performance in C++, use float in all cases unless you actually need double.
The topic of this article is: Using fix point instead floating point.
1.1. Three reasons to use fix point also in a floating point environment.

1) The physical world is a fix point world. Measurement values are limited in range and solution. For example an electrical current may not greater than 1000 A, elsewhere the hardware is damaged. The meaningful resolution may be 0.01 A, no more if 1000 A can be measured.

2) Small processors (for low prices or low power consumption) often have not floating point arithmetic on chip. It is possible to calculate floating point in software, but it needs time and memory.

3) Often processors for embedded don’t support double arithmetic by hardware, they supports only single precision, float. But a float number has only 24 bit mantissa. There is a problem of integrators, the hanging effect: If the growth of an integrator is lesser than its resolution, the integrator does not change its value. That is given for example by a PIDController with a timing constant of about 10000 of step times or more, for example step time is 50 µs, a proper value for electric control, and the integration time is 500 ms, a proper value for smoothing an electrical direct current signal. If the integrator has a value of about 4.0f, and the input value is 0.001f, the input multiplied with the timing factor is 0.0000001f. And this value is too less, the integrator does not integrate it. Floating numbers have only 7..8 valid digits.  But if instead int32 is used, it has more as 9 digits. If the integer is proper scaled, this additional 8 bit or 2..3 digits are essential for such applications.
See also discussion in https://www.microchip.com/forums/m487270.aspx
1.2. Disadvantages of fix point
The advantage of floating point is obviously: There is no necessity to think about scaling, all multiplications runs well, scale from integer to float (measurement values), calculate in float, scale to integer for the actuator, ready, a simple obvious and manageable software.
Using fix point the scaling should not be a problem. Adding of different scaled values should not be a problem, but an manual programmed adaption of the scaling is necessary (which is done automatically by floating point addition). But each multiplying is a problem. Multiplying of two 16 bit values needs a 32 bit result, and then rescale.
Using 16 x 16 ⇒ 32 bit multiplication in assembler using the best machine instructions is simple. But this is not supported by C/++. C/++ does not proper support such problems, as also not a fix point overflow. The programmer of this algorithm should know the machine instructions by thinking in C/++ statements.
1.3. Calculation time comparison
If a processor contains a floating point hardware arithmetic, it is optimized. For example a cosf()
(float cosinus) needs only 50 ns on a TMS320F28379D processor (and for this family) with 200 MHz clock. In opposite the fix point cosinus variant with interpolation second order (quadratic) with 14 bit accuracy (it is enough) need 400 ns. Because: It needs about 40 machine instructions. The cosf in float is builtin in the hardware and hence fast in about 10 clock cycles. If the interpolation will be supported also in the hardware, it may be fast. But unfortunately it isn’t so.
But, only comparison of single operations (cosf
 50 ns vs. cos16_emC
 400 ns) is not the real truth. Not a standard, but a specific algorithm is always assembled with machine code statements. Though a cosf
is faster in float arithmetic as in int16, the whole algorithm may need more time in floating arithmetic. But this depends from the machine instruction, memory access possibilities (32 bit float number versus 16 bit integer), register architecture of the processor, and some other things.
Of course, if a floating point hardware artithmetic is not present, but fix point multiplication, the fix point arithmetic is more fast. This is true for some cheap and power reduces processors. Often a fast assemblerwritten floating point library is available. It means floating point operation does not need a long time. It is tested for the TLE9879 Infineon processor with 40 MHz clock: 1.5 µs for a cos16_emC(…)
16 bit fix point operation, and 4 µs for the adequate cosf(..)
float operations. It means using fix point is a quest of detail. If fast cycle times are necessary also in this actuator controller, it fix pint ay be necessary but also sufficient.
Some processors don’t have fix point multiplication, for example the MSP430 from Texas Instruments. Such processors are anyway not proper for fast step times (µs). They have libraries for fix point multiplication as well as floating point operations written in assembler. For such processors it is a quest of available memory (Flash and RAM) for a decision between fix and float numeric. The libraries need space in memory.
2. Scaling of fix point numbers
Generally there are two systems to scaling values:

a) Using nominal values: This is the older approach, comes from analog technique. The values are given as 100% or 1.0 for the designed value range, where a overdrive of the 100 % to for example 120% should be possible. The nominal value of a voltage may be 230 V. It is mapped as 1.0. OverVoltage till 260 V is possible. This is 113%.

b) Using physical values. This may be exactly the SIunits, but also for example kA, MW, kV are possible, there are better readable also in debugging situations, if the software handles more with Megawatt than Watt.
The approach b) is generally better for floating point, but it is also able to adapt for fix point. The scaling should be generally determined by the user, or the application. But some things should be determined by the system.
For a system’s determination the approach a) is proper. For example to calculate with sin and cos, the range is 1.0 .. 1.0. It is a general question how to map this value to fix point numbers, because the fixpoint sin and cos should be work proper with it.
2.1. Nominal value for 16 bit fix point
It is purposeful to use a value of 30000 to map the 1.0. Then an overdrive to 1.09 is possible. It is not purposeful to define the range of the int16 exactly from 1..+1, because the 1 is mapped to 32768 (0x8000), but the +1.0 cannot be mapped, at least 32767. The fix point range is not symmetrically.
In a PLC ("Programmed Logic Control") such as Simatic S7 a nomial value of 27648 is used, see for example http://www.infoplc.net/files/descargas/siemens/infoPLC_net_Siemenes_S7_300_Escalado_Analogicas.pdf. There are many hits on a search engine in internet. This value is.
0x6c00 = 27648 = 3^{3}*2^{10} = 3 * 3 * 2^{10} = 27 * 1024
This value is able to divide by 12 which may sometimes a good property. It has a possible overdrive till 118% or ~1.18.
Another proper value with possibility to dived is 27720. It is near to the PLC standard value, also with possible overdrive to ~1.18, but it is able to divide by more numbers:
0x6c48 = 27720 = 2 * 2 * 2 * 3 * 3 * 5 * 7 * 11
able to divide by 2,3,5,6,7,8,9,10,11,12,14,15,16,18,20,22,24 ,28,30,32,33,35,36,40,42,44,45,50,60
The number 25200 as nominal value for 16 bit integer allows an overdrive till 130% or 1.3 which is usual enough for technical systems. It is able to divide by 100, 90, 80, 75, 70, 60, 50, 45, 40, 35, 30, 28, 25, 24, 21, 20, 18, 16, 15, 14, 12, 10, 9, 8, 7, 6, 5, 4, 3, 2 without fractional part.
0x6270 = 25200 = 2 * 2 * 2 * 2 * 3 * 3 * 5 * 5 * 7
This numer should be the best for scaling 16 bit fix point numbers with a nominal value of 1.0 and overdrive till 130% or 1.3, a resolution of near exactly 0.004%, which can be divided by the most of important factors. It is a universal number and also imageable while debugging.
Hence this number is defined in emC/Base/Math_emC.h
as
#define NOM1_i16_Ctrl_emC 0x6270 //= 25200
The number 23170 = 0x5A82
is approximately sqrt(2)*0x4000
(exactly 23170.4750)
. If a value is normalized with this, the quadrat of the value is normalized to 0x4000 and can be used for further calculation of a magnitued (see #sqrt and #rsqrt). An overdrive till 140% is possible. It is defined as
#define NOMqu_i16_Ctrl_emC 0x5a82 //23170
But because often multiplication and bit shifting is important, the scaling should be oriented to 0x4000
, which allows overdrive to 1.999 and its resolution is enough with 1/16385, better than 0.1 promille.
#define NOM2_i16_Ctrl_emC 0x4000
2.2. Nominal value of 32 bit fix point
It should be proper to use the same number as nominal value as for int16, but only expanded to 32 bit. This is
1651507200 = 0x62700000 = (25200)<<16 = 2^{20} * 3 * 3 * 5 * 5 * 7
The Overdrive is about 130%, the resolution is enough fine.
The other variant is, scaling to 0x40000000
for 1.0.
2.3. Problem of multiplication and scaling
The nominal values in the chapters above are proper able to use for add and sub operations. The nominal range is not changed while this operations are done.
But the multiplication does not work proper. For multiplication another thinking is necessary:
If 1.0 is scaled by 25200 as int16, the multiplication of 1.0 * 1.0 should also result in 1.0, it means 25200 in this scaling. But this is not so.
For factors, a scaling exactly as 2^{n} is necessary. After multiplication the result should be shifted. See example:
0x4000 * 0x4000 => 0x10000000 //example on scaling 1.0 =^ 0x4000 0x8000 * 0x8000 => 0x40000000 0xffff * 0xffff => 0xfffe0001 //scaling ..0.9999 as float mantissa
One can test it with a normal hexa calculator on PC. It is a simple unsigned integer multiplication. But see next chapter.
If you multiply a nominal scaled value with a factor which is guaranteed <1.0, you can set the decimal point left side of all bits. 0xffff then is mapped to 0.99998. 0x8000 is exactly mapped to 0.5. Then you can multiply without additional shifting. The result is proper in the same scaling but with double bit width:
(0x6270 = 25200) * 0xffff => 0x626f9d90
If you use only the upper 16 bit as result, you get 0x626f
= 25199
which can be scaled to 0.99998
if the NOM1_i16_Ctrl_emC
= 0x6270
scaling is used.
You should never have the idea to scale a factor with one of the nominal values not as power of 2. Why not? For example you have a factor which may be > 1.0 but never > 1.18. Then you may use the nominal value 27648
as proposed in Simatic S7, have a value of 1.05, convert it to 29030 = (int16)(27648 * 1.05f)
, multiply for example with 1.0
=^ 27648
, and get 802621440 = 0x2FD70800
. But what’s that for a number? You should multiply it with the reciproke by scaling to get the scaled value already. It is a non constructive effort.
For that example you should scale your factor as fractional part of 2^{n} as 0x8000
=^ 1.0
. It is seen as unsigned uint16
. Then multiply using NOM1_i16_Ctrl_emC
for only (!) the first factor:
uint16 f2 = (uint16)(0x8000*1.05f + 0.5f); //it is 0x8666 0x6270 * f2 => 0x33ADD8A0
You can use only the high part (to save effort), and shift it to left by 1 bit.
((0x33ADD8A0)>>16) => 0x33AD, it is a cheap operation in machine code. 0x33AD << 1 => 0x675a =^ ~1.05 with NOM1_i16_Ctrl_emC  scaling
Test:
0x675A / 0x6270 ^= 26458 / 25200 = 1.04992
This is the proper result in the same scaling as the input value. The possible exact result 0x675C
was not calculate because of rounding effects.
Rule: You can use a special scaling for one of the factor, the physical value, but you should always use a powerof2 mapping of the second factor.
A multiplication can be done inside the processor as 16*16 bit result in 32 bit, as 32 * 32 bit result in 64 bit. The result is without overflow. You can select the proper part and shift to get either back to 16 bit or back to 32 bit, but it is possible to produce an errorneous value if the range of the result is violated. You should know your value range. You should avoid elaborately range tests, it needs more calculation time as using the higher bit resolution.
Often, especially for integrators it is recommended to use the 32 bit width for 16 bit input values for further calculation. The end result can be shorten then, for the output value.
2.4. Angle representation in fix point also recommended for floating point environments
It is very simple. An angle between 0° and 360°, or better between 180° .. 0° .. 179.9999° can be represented with the full integer range:
180° 
 PI 
0x8000 
90° 
 PI/2 
0xC000 
0° 
0 
0x0000 
90° 
PI / 2 
0x4000 
+179.999° 
+0.99999* PI 
0x7FFF 
There is a very important advantage: The angle is full defined in a 360° circle. The difference (181°  179°) is as expected simple 2°. It is because integer arithmetic has no overflow handling. The values are closed between 0x7fff..0x8000. For example if a T1FBlock (smoothing block) is necessary and the angle varies in range around 180°, no problem. For float representation it is a concise problem: Always overflow > PI and < PI should be handled.
Hence it is better to use this angle representation also for floating point environments, with a simple conversion:
#define angle16_degree_emC(GR) (int16)(((GR)/90.0f) * 0x4000) #define angle16_rad_emC(RAD) (int16)((RAD) * (32768.0f/PI_float_emC) ) #define radf_angle16_emC(ANGLE16) ((ANGLE16)* (PI_float_emC / 0x8000)) #define gradf_angle16_emC(ANGLE16) ((ANGLE16)* (180.0f / 0x8000))
Calculation angle values, especially differences should be done using integer arithmetic. Before using the angle value as float, the simple multiplication above should be done.
3. Saturation arithmetic
The standard machine code statements of most of processors calculates integer arithmetic without overflow handling, but sets flags on overflow. This behavior was known as well as on the Z80 processor (Zilog) with the CY flag for unsigned overflow and the OV flag for signed overflow:
ADD 0x7FFE, 0x0002 => 0x8000, set OV flag ADD 0x7FFE, 0xFFFE => 0x7FFC, set CY flag, non setting OV flag.
But in C language this problem was unnoted. Using the flags was not a topic of C language. Wit the flags it is simple to correct an overflowed result, in assembly language.
Firstly the ARM processor technology and then also some other processors introduced so names saturation operations. This operations delivers an unoverflowed result instead setting an overflow flag. This result is proper useable. Technical values are limited. A faucet can not be up than up.
See for example

https://www.analog.com/media/en/dspdocumentation/processormanuals/SC58x2158xprm.pdf, search to "saturation".

https://developer.arm.com/documentation/dui0553/b: Cortex ™ M4 Devices Generic User Guide, chapter 3.7 Saturation Instructions
This operations are different for some processors but similar. But in C/++ language they are still not regarded as standard. Some different approaches are in use.
3.1. Macros for optimizing saturation arithmetic
In emC/Base/Math_emC.h
there are defined some macros which can be adapted to proper assembly instructions, especially for ARM, for this saturating operations. The standard implementation is defined in C language, proper but not full optimized for all processors.
adds16sat_emC(R, A, B) 16 bit signed saturation add addu16sat_emC(R, A, B) 16 bit unsigned saturation add subu16sat_emC(R, A, B) 16 bit unsigned saturation sub adds32sat_emC(R, A, B) 32 bit signed saturation add addu32sat_emC(R, A, B) 32 bit unsigned saturation add subu32sat_emC(R, A, B) 32 bit unsigned saturation sub
The signed subtraction can anyway performed by using the negated value with signed add, though the ARM has also a signed saturation subtract.
Some more instructions, sometimes necessary, regard some especially assembly instructions of the ARM, maybe similar also in other processors:
add2s16sat_emC(R1, R2, A1, A2, B1, B2) 16 bit signed add with two values add2u16sat_emC(R1, R2, A1, A2, B1, B2) 16 bit unsigned add with two values sub2u16sat_emC(R1, R2, A1, A2, B1, B2) 16 bit unsigned sub with two values
This instructions which handles two operation with one assembly instruction may be necessary especially for 16 bit integer operations. The 8 bit saturation operations are not (yet) supported by macros. This operations are familiar for color calculations, not the focus of embedded algorithm. They can be created if necessary.
This macros are defined in C language in a common proper way in emC/Base/Math_emC.h
. For a specific processor this macros can be defined especially using the __asm(…)
definition in the compl_adaption.h
. Because of #ifdef adds16Sat_emC
etc. is tested, this specific definitions of the macros in compl_adaption.h
are used.
Using this macros enables optimized usage of machine instructions though it is programmed in C/++.
3.2. Macros to detect a saturation
On ARM processor the saturation operations set the Q
flag if a saturation occurs, and let it unchanged elsewhere. The usage of this should due to the following pattern:

Reset Q flag

Do some operations, which may cause saturation

Test Q flag to detect whether the whole algorithm has saturated anywhere.
On saturation a warning or such can be noted. The saturated result is able to use, this is the approach of saturating arithmetic. Evaluating the Q flag is only for additional information.
But this additional information may be necessary. Hence the macros for saturation arithmetic in C language regard this too.
It is necessary to write the following macro in a statement block which uses the saturating macros:
DEFsatCheck_emC
To check whether saturation has occured, it can be tested:
if(satCheck_emC()) { ..... clearSatCheck_emC(); //to clear the flag.
The usage of this flags are optional, if necessary. If this macros are not used, the effort for the boolean variable in DEFsatCheck_emC
is removed by optimizing compilation.
If this macros are adapted to a specific machine set (in compl_adaption.h
), for the ARM processor it should check and clear the Q
flag.
4. Signed or unsigned, 16 or 32 bit multiplication
Generally the multiplication of two 16 bit values results in 32 bit. The multiplication of two 32 bit values results in 64 bit. This is given by bitmathematical correlations. The multiplier hardware in some processors, also in cheap processors for 16 bit (example MSP430, Texas Instruments) do so, but often a 64 bit result is stored either as 32 bit from the low part, or 32 bit from the high part. The second one is especially convenient if left arranged mantissa are multiplied (similar as a floating point mantissa).
But programming in C/++ language don’t regard this relationships. As also for add/sub arithmetic the width of the operands and the result are the same. Maybe some given libraries for C or C++ do it better, but special libraries maybe written in assembler for a special processor are not a contribution to an "embedded multi platform" C/++ programming style, they are not commonly useable.
For example to support a 16 * 16 ⇒ 32 bit multiplication inside Texas instruments processor it should be written, originally copied from: https://www.ti.com/lit/an/spra683/spra683.pdf.
INT32 res = (INT32)(INT16)src1 * (INT32)(INT16)src2;
But this is a special writing style from Texas Instruments for its compiler, it may not be a hint for all processors.
4.1. Macros for multiplication, simple readable and possible for adaption
There are also some "add and multiply" instructions, and some instructions to detect overflow. Reading some manuals of processors, it seems to be adequate to program such parts in assembler. The link: https://www.quora.com/Howmuchcouldweoptimizeaprogrambywritingitinassembly discuss some of good or bad reason to program in assembler, see there the contribution from Hanno Behrens: "Do people still write assembly language?".
From position of producers of tool or processor support, it is appropriate to deliver some libraries for expensive mathematical and control algorithm (including PIDcontrol with some precaution of integral windup and such things). The application programmer can use it, no necessary for own mathematics. But this approach obliges a user to the one time selected hardware platform. The programs are not portable. There is no standard (for C/++) to unify such libraries.
The C/++ language allows constructs to implement specific assembly instructions. That is an advantage of C language. In this kind specific macros or inline functions can be defined as interface, which can be simple implement by inlineassembly statements or macros with specific castings (one time written) for any processor, respectively a standard implementation with Ccode exists, which may be enough fast for first usage.
The emC concept is predestinated to support such ones, because it is "embedded multi platform C/++". Hence it do so.
The following operations (may be specifically implement as macro or inline in the compl_adaption.h
) are defined and preimplemented in C (emC/Base/types_def_common.h
):
void muls16_emC(int32 r, int16 a, int16 b); //16 bit both signed, result 32 bit void mulu16_emC(uint32 r, uint16 a, uint16 b); //16 bit both unsigned, result 32 bit void mul32lo_emC(int32 r, int32 a, int32 b); //32 to 32 bit result lo 32 bit void muls32hi_emC(int32r, int32 a, uint32 b); //32 to 32 bit signed, result hi 32 bit void mulu32hi_emC(uint32 r, uint32 a, uint32 b);//32 to 32 bit usigned, result hi 32 bit void muls32_64_emC(int64 r, int32 a, int32 b); //32 to 64 bit signed void mulu32_64_emC(uint64 r, uint32 a, uint32 b); //32 to 64 bit unsigned
Additional there are some more instructions which supports the "mult and add" approach.
void muls16add32_emC(int32 r, int16 a, int16 b); void mulu16add32_emC(uint32 r, uint16 a, uint16 b); void mul32addlo_emC(int32 r, int32 a, int32 b); void muls32addhi_emC(int32 r, int32 a, uint32 b); void mulu32addhi_emC(uint32 r, uint32 a, uint32 b); void muls32add64_emC(int64 r, int32 a, uint32 b); void mulu32add64_emC(uint64 r, uint32 a, uint32 b);
This operations cannot be a part of an expression, they are in form of statements. For that it is sometime more simple to build proper macros or assembly expressions.
Why a signed or unsigned distinction is not necessary for the mul32lo_emC
? Because:
The lo part of a multiplication is the same independent of the sign of the inputs.
This is adequately similar also for a mul*16_emC
. The essential thing is: The inputs should be exactly enhanced to its 32 bit form, then multiplicate 32 * 32 bit, and use the lower result. But: That is more effort. The multiplication 16*16 bit to 32 bit needs lesser time and hardware resources. Only the sign should be automatically correct expanded.
That’s why usual embedded processors have often machine instructions for multiplications:

16bit signed, result 32 bit

16 bit unsigned, result 32 bit

32 bit, presenting the lower part

32 bit signed and unsigned, presenting the higher part
That are the same as the mul*_emC
instructions above. That are the usual necessary ones.
4.2. How to use 32 bit multiplication to lower 32 bit, overflow problem
A multiplication operation 32 * 32 bit to 32 bit lower part is often offered as machine code instruction.
Generally the 32*32 bit multiplication using the lower 32 bit result may have an overflow problem. Then the result is not useable. But:
If the sum of the relevant input bits does not exceed 32, then the result is correct.
In contrast to the 16*16 bit multiplication, it is possible for example to use 24 bit resolution of a signed number for one factor, multiplied with a factor which needs only 8 bit. That can be a gain from 0..15 with resolution 1/16 or for the user: 0.1. It is fine enough for some applications. The gain is scaled 1.0 =^ 0x10
. Example multiply 1.5 * 1.0667, the scaling of the first factor is 1.0 ^= 0x00'400000
, which allows a 24bitrange 2 .. +1.999, but before multiplication a limitation may be done to prevent overflow. 1.5 ^= 0xFF’A00000
0xFFA00000 * 0x00000011 => 0x10'F9A00000
It is important that the both 32 bit factors are exactly expanded with the sign. It means the 24 bit value 0xA00000
(negative) is expanded to 0xffA00000
and the 0x11
(positive) is expanded to 0x00000011
. The result is correct if the sign of the factors are correctly expanded. A difference between signed and unsigned multiplication is not done.
The resulting value is 0xF9A00000
, the higher part 0x00000010
of the theoretical 64 bit result is not built. The result should be shifted 4 bit to right because the second factor has 4 dual digits in the fractional part:
0xF9A00000 >> 4 => 0xFF'9A0000
This is near the input value, only multiplied with 1.0666, presenting
0x9A0000 / 0x400000 ^= 0x660000 / 0x400000 ^= 1.59375
Note: The negative 24 bit number 0x9a0000
is 0x660000
negated, usual 32 bit arithmetic is used, 0xff9A0000
⇒ 0x00660000
.
4.3. Adaption of the macros of fix point multiplication to the machine code
If you want to multiply 16 * 16 bit signed or unsigned to 32 bit, you should expand the input values exactly like necessary for the sign, and simple multiply. To better understand, the numbers have its decimal point after 4 bits, the input range is 8.0 .. 7.9999. The decimal point of the result is with 8 bits before.
0xc000 * 0xc000 => 0xffffc000 * 0xffffc000 => 10000000 4 * 4 => +16.0 0xc000 * 0x4000 => 0xffffc000 * 0x00004000 => F0000000 4 * 4 => 16.0
In C it is:
#ifndef muls16_emC #define muls16_emC(R, A, B) { \ R = ((int32)(int16)((A) & 0xffff) * (int32)(int16)((B) & 0xffff)); } #endif
#ifndef mulu16_emC #define mulu16_emC(R, A, B) { R = ((uint32)(uint16)(A) * (uint32)(uint16)(B)); } #endif
That are the standard definitions of the both macros. The usage is:
int16 a,b; ... int32 result; muls16_emC(result, a, b); uint16 p,q; ... uint32 uresult; mulu16_emC(uresult, p, q);
That delivers anyway the correct results. The compiler may recognize that the inputs have only 16 bit because of the mask and type casting. It is able to expect that the compiler uses its best machine code which may be a 16 * 16 ⇒ 32 bit multiplication.
It may be possible, depending on the compiler properties, that the following term is sufficient:
int16 a,b; ... int32 result = a * b;
But that is not sure in any case. Hence this writing style is not portable. See also
It may be possible that the result has 16 bit and it is expanded to 32 bit, or it works exact. Unfortunately the C standard does nothing guarantee.
For some compiler an __asm(…)
statement can be used to select the desired machine code. This can be written as:
#define muls16_emC(R, A, B) \ { R=(int16)((B) & 0xffff); __asm("MUL %0, %1, %0": \ "+r" (R): "r" ((int32)(int16)((A) & 0xffff))); } #define mulu16_emC(R, A, B) \ { R=(uint16)((B) & 0xffff); __asm("MUL %0, %1, %0": \ "+r" (R): "r" ((uint16)((A) & 0xffff))); }
This is an example for ARM processors. The ARM M3 has the 32 * 32 ⇒ 32 lobit multiplication, see chapter above, but not dedicated 16 * 16 ⇒ 32 bit multiplication machine statements because the ARM machine has always 32 bit width. The MUL
instruction works with two registers, one operand should be the same as the destination. Hence there is no possibility for a optimization. The compiler produces similar results on using the standard macros.
It depends of the type of processor. The Texas Instruments TMS320C2xx processor has the 16 * 16 ⇒ 32 bit MPY
statement, but the C2000 compiler does not support __asm(…)
instructions with self defined register, it supports only simple static machine code. But the compiler is optimzing too without the asm
macros.
For writing the __asm
macro for gcc compiling, and also for ARM compiling (AC6) see:
4.4. Example algorithm of a smoothing block with 16 bit input and output but 32 bit state
The data are defined in emC/Ctrl/T1_Ctrl_emC.h
as:
typedef struct T1i_Ctrl_emC_T { /**This value is only stored while param call. It is not used. */ float Ts; /**The difference qx for usage as smoothed differentiator. * dxhi is the representative int16part regarded to input. */ Endianess32_16_emC_s dx; /**The output value and state variable. * qhi is the representative int16part regarded to input. */ Endianess32_16_emC_s q; /**Factor to multiply the difference (xq) for one step. * This factor is calculated as 65536 * (1.0f  expf(Tstep / Ts)) * to expand the 16bitxvalue to 32 bit for q. * A value Ts = 0, it means without smoothing, results in 0xffff because only 16 bits are available. * The largest Ts is 65000 * Tstep, results in 1 for this value. * Larger Ts does not work. */ Endianess32_16_emC_s fTs; } T1i_Ctrl_emC_s;
The 32 bit values are also accessible as 16 bit parts by building a unit. Therefore a struct Endianess32_16_emC_s
is used which contains only a unit to access the 32bit and the 16bit hi and lo parts. This struct
is defined depending on the endianness of the processor.
The T1factor is built with:
bool param_T1i_Ctrl_emC(T1i_Ctrl_emC_s* thiz, float Ts_param, float Tstep) { thiz>Ts = Ts_param; float fTs = (Ts_param <= 0 ? 1.0f : 1.0f  expf(Tstep / Ts_param)) ; fTs *= 0x100000000L; thiz>fTs.v32 = fTs >= (float)(0xffffffff) ? 0xffffffff : (int32)( fTs + 0.5f); return true; }
To convert the factor, floating point arithmetic is used. In a cheep 16 bit processor it is calculated by software, needs a longer time but the factors are usual calculated only in startup time or in a longer cycle. It is possible to give factors also without conversion, or via conversion over a table, to speed up it. The factor has the decimal point left, and up to 32 fractional bits.
The calculation usual called in a fast cycle is simple. It uses a 16 * 16 ⇒ 32 bit multiplication, which is fastly usual available also in cheep processors. The 32 bit result is used for the integration. The result value uses the higher 16 bit part of this integrator.
static inline int16 step_T1i_Ctrl_emC(T1i_Ctrl_emC_s* thiz, int16 x) { thiz>dx.v32 = (uint32)(thiz>fTs.v16.hi) * ( x  thiz>q.v16.hi); thiz>q.v32 += thiz>dx.v32; return thiz>q.v16.hi; //hi part 16 bit }
The possible higher accuracy of (x  thiz→q)
is not used in this algorithm. But this may be necessary for longer smoothing times. The algorithm above limits the smoothing time to about 65000 * Tstep, because the used high part of fTs
is then 0x0001, for greater times it is 0x0000 and nothing occurs.
If it is necessary to use longer smoothing times, it requires a 32 * 32 ⇒ 32 bit multiplication, where the higher part of the 64bitresult is used. A further improvement may be possible to use a 64bitwidth integrator, but this is not realized here. It is a quest of calculation time effort. The better step routine for longer smoothing times can be called in the application:
static inline int16 step32_T1i_Ctrl_emC(T1i_Ctrl_emC_s* thiz, int16 x) { thiz>dx.v32 = (int32)(((uint64)(thiz>fTs.v32) * ( (int32)(x<<16)  thiz>q.v32))>>32); thiz>q.v32 += thiz>dx.v32; return thiz>q.v16.hi; //hi part 16 bit }
Right shift (…)>>32
takes the one 32 bit result register from the multiplication, ignores the lower multiplication result. But that is true for this algorithm.
Right shift (…)>>32
takes the one 32 bit result register from the multiplication, ignores the lower multiplication result. But that is true for this algorithm.
The dx
part can be used as differtiator with smoothing, simple accessible after this calculation with
static inline int16 dx_T1i_Ctrl_emC(T1i_Ctrl_emC_s* thiz, int16 x) { return thiz>dx.dx16.dxhi; }
The outside used values are all 16 bit, for a 16 bit controlling algorithm on a 16 bit controller. But the internal state of the smoothing block is stored as 32 bit. Both, it results in the machine execution of the multiplication, and (!) for the resolution of the smoothing. You can use a great smoothing time, and get exactly results without hanging effect.
If floating point arithmetic is used, the algorithm is more simple to write and understand, but you get the hanging effect for lesser smoothing time (disadvantage of the implementation) and you need always more calculation time, also if a floating point calculation hardware is present.
5. Trigonometric and sqrt routines for fix point arithmetic
The sin, cos, sqrt etc. are part of the standard C/++ libraries for floating point, single and double precision, but not for fix point.
The other question is: calculation time.
5.1. Linear and quadratic approximation
The red curve should show the really function.
The blue lines are the tangents. Near the point itself the linear approximation is accurate as possible. But between the points a greater abbreviation is given. This can be redeemed by a second order (quadratic) approximation, not shown here. As tested the quadratic approximation delivers errors less or equal 4 for 16 bit numbers, with only 16 supporting points, a well useable result.
dx := (x  x_{p});
y := y_{p}(x_{p}) + g_{p}(x_{p}) * dx + g2_{p}(x_{p}) * dx^{2};
Where g_{p} is the tangent (the exactly deviation of the curve) or the quotient
g_{p} = (y_{p+1}  y_{p1}) / (x_{p+1}  x_{p1});
and g2_{p} is the difference of the g_{p} of the left and the right point:
g2_{p} = ( (y_{p+1}  y_{p}) / (x_{p+1}  x_{p}) )  ( (y_{p}  y_{p1}) / (x_{p}  x_{p1}) );
If the step width between the points in x are equidistant: x_{pp}, it is more simple:
g_{p} = (y_{p+1}  y_{p1}) / ( 2*x_{pp});
g2_{p} = ( (y_{p+1}  y_{p})  (y_{p}  y_{p1}) ) / x_{pp};
This is the same as shown for example in https://www.astro.unijena.de/Teaching/Praktikum/pra2002/node288.html, (visited on 20210413), only the term for the quadratic part is shown there more 'mathematically', but it is the same: `( (y_{p+1}  y_{p})  (y_{p}  y_{p1}) ) == (y_{p+1}  2 * y_{p} + y_{p1}). The left form prevents overflow in calculation with fix point arithmetic.
Simple linear approximation but with middlecorrect of abbreviation
But the algorithm for a linear approximation is very simple:
y := y_{p}(x_{p}) + g_{p}(x_{p}) * (x  x_{p})
x_{p} is the x value where the nearest point is found. g_{p} and y_{p} are read from the table. dx = x  x_{p} is left and rigth side from the point.
The green lines are slightly shifted. The error on the supporting points are approximately equal to the error in the middle, the error is halved. But this is not the most important effect. More important may be that an integration does not sum up the deviations only in one direction. It can be seen on cos values: Its integration should deliver sin values, and the area or the range Pi .. Pi should be 0. The calculation is the same, but the values are corrected in the table.
For the C/++ implementation the tables contains valued immediately given as hexa values. But this values are calculated by a Java program, using the double precision functions, with the named correctures. See org.vishia.math.test.CreateTables_fix16Operations
(https://github.com/JzHartmut/testJava_vishiaBase).
In C language the core algorithm for the linear approximation is written as (emC/Base/Math_emC.c
):
#define LINEARinterpol16_emC(X, TABLE, BITSEGM) \ uint32 const* table = TABLE; \ uint32 valTable = table[( X + (1 <<(BITSEGM1) ) ) >>BITSEGM]; \ int16 dx = ( X <<(16  BITSEGM) ) & 0xffff; \ int16 gain = (int16)(valTable & 0xffff); \ muls16add32_emC(valTable, dx, gain); \ int16 y = (int16)(valTable >>16); \
It is a macro used in some functions, see below. The expanded macro is well for compilation. Using an inline function may have disadvantages, for example calculation (16  BITSEGM)
as compiler constant. This macro is defined only in the compiling unit emC/base/Math_emC.c
locally, not common useable.
valTable: For faster access on 32 bit processors only one value with 32 bit is read from the table. The high part is the supporting point, the low part is the gain. The gain is dispersed in the next line.
Index to the table: The x value in range 0x0000 .. 0xffff
supplies the index to the supporting points. Depending on the size of a segment, given in BITSEGM
as number of bits, the x value is shifted to right. Adding (1 <<(BITSEGM1)
before shifting gets the index of the right point for the right part of a segment. The index calculation are a few operation in a 16 bit register. The value (1 <<(BITSEGM1)
is calculated as constant by the compiler anytime. Example: BITSEGM = 9
means, a segment is for example from 0x3c00 .. 0x3e00
, 9 bits. For a value 0x3d65
a 0x100
is added (1 << 8
), and after right shift from 0x3e45
the value 0x1f
as index to the table results.
dx: The difference value inside the segment is taken from the x value shifting to left, shift out the index bits. The dx is positive or negative. Example: For the value 0x3d65
a shift to right with (169 ⇒ 7) bits is done and results in 0xB280
. The operation & 0xffff
is optimized because the compiler may or should be detected that it is only a 16 bit operation. Visual Studio may detect a run time error because it expands the numeric range to int32 and checks the follwing casting, if & 0xffff
was not written there, though x is an int16
and the operation should be performed as int16
operation. Java does similar, but for Java it is defined in the standard, that all integer operations are executed with at least 32 bit.
muls16add32: The multiplication uses both 16 bit values. The useable result is located only in the high bits 31..16 of the multiplication result. Hence the addition with the whole table value with the supporting point in bit 31..16 would add correctly 16 bit with rounddown if the lo bits 15..0 of valTable
would be set to 0. The Savings of this operation gives a possible overflow, an error of only one bit. This is not relevant, calculation time saving is more relevant.
int y: The operation (int16)(valTable >>16)
uses 32bit half register operations or uses the 16bitregister immediately. As expected all compiler detects this situation, and do not produce the >>16
operation, except it is a 32 bit processor without access to half register.
Adequate the (int16)(valTable & 0xffff)
is a optimized halfregister optimization, without mask with 0xffff
which would need loading a constant in machine code.
Hence this operation is so fast as possible.
5.2. cos and sin, algorithm of linear interpolation in C language
sin and cos are adequate, only shifted by 90°. Hence the cos is programmed, and the sin is derived with:
#define sin16_emC(ANGLE) cos16_emc((ANGLE)0x4000)
The cos is symmetric on yaxes and pointsymmetric for 90°. The interpolation need only be executed between 0° and 90°:
int16 cos16_emC(int16 angle) { int16 x = angle; int16 sign = 0; if(angle <0) { // cos is 0yaxes symmetric. . x = x; // Note: 0x8000 results in 0x8000 } if(x & 0xc000) { x = 0x8000  x; // cos is pointsymmetric on 90? sign = 0x8000; } //now x is in range 0000...0x4000
Because the reduced x range only 32 and not 128 supporting points are need. This reduces Flash memory amount. But the preparation increases the calculation time. Hence it may be dismissed.
More simple is, using 64 supporting points and only build the absolute value. Results from positive and negative angles are exactly the same.
int16 cos16_emC ( int16 angle) { int16 x = angle; if(angle <0) { // cos is 0yaxes symmetric. . x = x; // Note: 0x8000 results in 0x8000 } //now x is in range 0000...0x4000 //commented possibility, using interpolqu, extra call, more calctime //int16 val = interpolqu16(angle1, cosTableQu); // // access to left or right point LINEARinterpol16_emC(x, cosTable, 9) /* uint32 const* table = cosTable; uint32 valTable = table[( x + (1<<(91) ) >>9]; int16 dx = ( x <<(16  9) ) & 0xffff; int16 gain = (int16)(valTable & 0xffff); muls16add32_emC(valTable, dx, gain); int16 y = (int16)(valTable >>16); */ return y; }
This is the whole cos16_emC
operation, with all comments for experience (20210407).
The cosTable
looks like (shortened):
static const uint32 cosTable[] = { 0x7ffffffb // 0 0 , 0x7fd3ffb2 // 1 200 , 0x7f5dff64 // 2 400 , 0x7e98ff15 // 3 600 ..... , 0x0c8bf9c0 // 30 3c00 , 0x0648f9bb // 31 3e00 , 0x0000f9b9 // 32 4000 , 0xf9b9f9bb // 33 4200 , 0xf376f9c0 // 34 4400 ..... , 0x80a4ff63 // 62 7c00 , 0x802dffb2 // 63 7e00 , 0x8000fffa // 64 8000 };
The value for 0° is set to 0x7fff
which is 0.99997, because the value of 1.0 cannot presented. 90° is exactly 0, and 180° (angle 0x8000
) is exactly 0x8000
, which is 1.0. The error of approximation is at max 8..8, tested. It is greater in the ranges around 0° and 180°, lesser near 1..1 in the range around 90°. Using the sinus with the same operation (only the angle is shifted) means, a sin in the linear range around 0° has only an interpolation error from 1..1 related to 32768.
5.3. Calculation of the tables for supporting points in Java
The Java algorithm to get the supporting points and gain uses double algorithm and rounding. It is written commonly for any mathematic function. See snapshot of org.vishia.math.test.CreateTables_fix16Operations.java
:
public class CreateTables_fix16Operations { /**The functional interface for the operation as lambda expression. */ @FunctionalInterface interface MathFn { double fn(double x); }
Firstly an internal interface is defined for all the functions.
The common operation to create a table is:
/**Create a table for linear interpolation for any desired math operation * @param bitsegm Number of bits for one segment of linear interpolation (step width dx) * @param size Number of entries 1 in the table. The table get (size+1) entries, should be 16, 32, 64 * @param scalex Scaling for the xvalue, this result is mapped to 0x10000 (the whole 16 bit range) * @param scaley Scaling for yvalue, this result of the operation is mapped to 0x8000. * @param fn The math function as Lambdaexpression * @param name of the file and table as C const * @param fixpoints array of some points [ix] [ yvalue] which should be exactly match * @return The table. */ public static int[] createTable(int bitsegm, int size, double scalex, double scaley , MathFn fn, String name, int[][] fixpoints) { .... }
It is a common operation for all mathematic functions to support tables. Invocation for the cos is:
public static int[] createCosTable() { int[][] fixpoints = { {0, 0x7FFF}, {1, 0x7FD3}, {32, 0x0}, {64, 0x8000} }; int[] table = createTable(9, 64, Math.PI, 1.0, (x)> Math.cos(x), "cos", fixpoints); return table; }
There are some manual given points, especially 0x7fff
and 0x8000
for the first and last point. Elsewhere the algorith calculates a value of 0x7ffc
and a gain of 0 for the first segment. The point {32, 0x0}
is the value for 90°, which should exactly much. But the value is calculated to 0 also without this setting because the cos is linear and point symmetric in this range.
The segment size is 9 bit =^ 0x200
. With 64 values the range 0x0..0x8000
is produces. The Math.PI
is the x scaling for this range. 1.0
is the y scaling regarded to 0x8000.
The expression (x)→ Math.cos(x)
is a "Lambda expression" in Java, a simple kind to provide a function.
"cos" is the name of the table "cosTable". This routine generates all points, test all points with printfoutput for manual view and writes the file yet to T:/<name>Table.c
.
Java is a more simple programming language and hence proper for algorithm tested on PC, as preparation for embedded software. Same written and tested in C++ is more complicated. It is not necessary. The advantage of Java is: It has the same approaches for integer processing as C/++.
5.4. arctan2
An important function to get the angle from a complex presentation is the arctan2 operation. Normally it divides both values and uses either the arctan or the arccot operation.
int16 arctan(int16 im, int16 re) { if(re > im) { return arccotan(im/re); } else { return arctan(re/im); } }
This is the proven concept, uses stable mapping of mathematic functions and prevent division by zero. But: It needs a division which is not a cheep operation for fix point arithmetic.
There is another way to calculate:
Usual either the values for the { re, im } vector are normalized or the magnitude is a point of interest, and the normalization is a more cheap and by the way occurring operation.
Hence the arctan is defined for normalized values and uses a arccos table. The arccos does not need a decision, it is a continuous table.
int16 atan2nom16_emC ( int16_complex x ) { short re1 = x.re; short im1 = x.im; int quadr = 0; if(im1 <0) { im1 = (short)im1; quadr = 2; } if(re1 <0) { re1 = (short)re1; quadr = 4; } int xasin; if(re1 < im1) { xasin = re1; quadr = 1; } else { xasin = im1; } LINEARinterpol16_emC(xasin, asinTable, 10); switch(quadr) { case 0: return y; case 1: return 0x4000  y; case 5: return 0x4000 + y; case 4: return 0x8000  y; case 6: return 0x8000 + y; case 7: return 0xc000  y; case 3: return 0xc000 + y; case 2: return y; default: return 0; } }
This is the whole algorithm, see emC/Base/Math_emC.c
. Simple comparisons determine one of the 8 segments of the result. The input for the linear interpolation is always in range 0x0000 .. ~ 0x2d80
which is a proper range for the approximation.
The normalization of the xinput is 0x4000 for 1.0. Hence the lesser value of re and im part is never greater then ~sqrt(2)*0x4000 which is approximately 0x2d41
. But less errors for the normalization may force a slightly higher value. But the asinTable
goes to 0x4000
.
The calculation time of this solution for 16 bit fix point is approximately the same as a floating point atan2f(im, re)
operation of the math.h
standard library for C, measured on the TMS320F28379 Processor from Texas Instruments, which has Floating point support. In opposite most int16 or int32 operations needs about 150% of the calculation time compared with the floating point calculations on that processor,
How to normalize:

Use either
rsqrt16_emC(….)
from the sum of the squares of the components, and multiply the components with that factor (reciproke of square root), 
or use
Nom_int16_complex
as step routine in a continouos control algorithm.
The normalization is often anyway necessary. To get the magnitude (not only the revers magnitude) either a division can be used, or it is gotten also with the Nom_int16_complex
.
5.5. sqrt
The square root is only defined for positive numbers. Hence the input range can be from 0… <4.0 mapped to 0x0 … 0xffff. The sqrt is then in range 0..<2.0 which can be mapped to 0x0…0x7fff. The result can be used than as signed value.
int16 sqrt16_emC ( uint16 x);
is defined in that kind. The value 0x4000, mapped to 1.0, results in 0x4000.
The sqrt function has a problem in 0range. The deviation of sqrt is 1 / (2 * sqrt(x))
. This is infinitive for exact 0. The right image may show that the sqrt line (red) is perpendicular at the point 0. If the linear approximation is used with equal distances in x, it has a problem in the first segment. The linear connection does not represent useable values.
But: Is an exact sqrt necessary for values near 0? Often values in the nominal range are the point of interest. It means between 0.5 … 2 of the nominal values. In that range the simple linear approximation of the sqrt is well useable. The test of results of the sqrt16_emC(…)
uses values started from 0x600 only, it is ~ 0.1. From this value the abbreviation is <9, it is <0.1% regarded to the value 0x4000 =^ 1.0.
To get the sqrt in controlling systems there is another possibility shown in the right image. The xvalue (the quadrat) is delivered as the input of a controller feedback. The feedback is the output multiplied with itself. This is compared with the input, and the controller will control it to the equal value (zerodeviation). Hence the output of the controller is the sqrt because the feedback is the square. See also chapter #Nom_int16_complex
This can be used anytime if the input value x does not change in a wide range for one to the next step, as given in usual controlling systems. But also for the step response of the control loop only a first deviation is given. It is fast.
The gain of this controller loop should depend on the max possible value of the output, with multiplication the gain in loop should be < 1.0. This is true for the right image for input values less than 4.0.
The right image shows the controller output and the calculated sqrt. Both curves lie on top of each other. An additional offset of 0.02 is only used to see both curves in the diagram.
5.6. rsqrt  reciprocal square root
The reciprocal or inverse square root can be used to normalize a value. The value should be multiplied with the reciprocal square root of the quadratic sum of the components:
rm := rsqrt(y.re * y.re + y.im * y.im);
y_{nom}(re, im) := y(re, im) * rm;
The magnitude itself is not calculated with this, but the division, more expensive for simple controller, is prevented.
m := sqrt(y.re * y.re + y.im * y.im);
y_{nom}(re, im) := y(re, im) / m;
…needs two divisions, for real and imagine part, instead two more cheap multiplications.
But, the value for 0 is infinitive. It is mapped to the max possible value, it is 0x7fff
=^ 1.9993
. Lesser values than 0x14a0
=^ 0.322
deliver a result which is lesser as expected. It means using for normalization delivers to less values. The supporting point table has 33 entries. Values from 0x14a0
= 0.322
till 0x27a0
=^ 0.641 may have an abbreviation till 42/16384. Values from 0.641
.. 1.9993
may have an abbreviation ⇐10 / 16384 which is < 0.1%. It means for values near the nominal value 1.0
=^ 0x4000
it is proper useable.
int16 rsqrt16_emC ( int16 x);
works in range 0
… 1.9993
mapped as 0x0
.. 0x7fff
and delivers a value in range 1.9993
… 0.5
mapped to 0x7fff
.. 0x2000
as int16
result.
5.7. reciprocal
If the sqrt16_emC()
is used for normalization or for any other reason the reciprocal value may need:
y := 1 / x;
Because a division is a noncheap operation on some controller without specific division hardware support, using a linear approximation is also possible and evident to use.
5.8. Nom_int16_complex
This is a class for a controlling algorithm to get the nominal values of a vector and/or the magnitude.