RPG Fundamentals: Arrays
RPG has changed so much over the years that, increasingly, we find our students are unaware of many of the basic tools available to them. One is array processing.
By Jon Paris and Susan Gantner05/09/2017
This month we're beginning a series on modern RPG fundamentals. So many changes have taken place in the language over the years that, increasingly, we find that students in our classes are unaware of many of the basic tools available to them. So, begging your forgiveness in advance if this is all "old hat" to you, we're going to start off with array processing. In this piece, we're looking at loading and searching arrays. Note that our intent is to focus on RPG's built-in facilities, and not the many more varied and flexible array functions provided by add-on libraries such as Arraylist, which was recently added to the OSSILe Open Source collection. Hopefully by the time you finish this series you'll be eager to explore the capabilities of libraries such as this.
Over the years, as we have added languages other than RPG to our personal toolboxes, one thing has become abundantly clear. As RPGers we do not make as much use of arrays as we should. Often we find people using a work file, for example, when an array would have been a better performing choice. There are probably many historical reasons for this. For one, until V6 came along, the maximum size of an array was limited to 32K elements. For another, array operations such as SORTA assumed that all elements in the array were in use. This meant that you had to preset the array content to high values in order to ensure that you didn't end up with a bunch of blanks/zeros at the beginning of the array. Last but not least, performance of array look ups was poor. But we're getting ahead of ourselves—more on this later.
Basic Array Searching
In the code example below, (F) shows the initial definition of the sample array we'll be using. For these tests we used a size of 5,000 elements. (G) shows the invocation of the FillArray() subprocedure. We decided to load the array with 5,000 entries and to use a maximum value of 10,000 for the valueArray values. This will produce data that in our simple tests could theoretically find a match roughly 50% of the time. See the sidebar "Generating test values" for more details on the method used.
At (I) we use %lookup to search the array for the value in searchFor, test the result of the search and increment the appropriate counter.
(F) dcl-s valueArray Packed(7) Dim(5000); // Fill array with random values (G) FillArray( 5000 : 10000 ) ; (H) For searchFor = 1 to maxValue; (I) location = %LookUp( searchFor: valueArray ); If location = 0; countMissing += 1; Else; countFound += 1; EndIf; EndFor;
This approach to searching an array is the most common that we have encountered over the years—but it is relatively inefficient. This is because coding it this way requires %Lookup to perform a linear search. In other words, it will compare each entry in the array in turn against the search value. The result is that, on average, for an array of n elements the number of comparisons will be n / 2. In our case, that means that an average of 2,500 comparisons need to be made to determine if the search argument is present in the array or not.
If this method is inefficient why is it still so commonly used? There are probably a number of reasons, but the most likely is that this method of processing was the only one available when using the old LOOKUP op-code. And while RPGers have embraced %Lookup as a replacement for the op-code, many are unaware that %Lookup offers a high-speed search option. So how do we utilize that?
There are two basic changes required to our program in order to take advantage of the high-speed search. First we need to add the ASCEND (or DESCEND) keyword to the array definition. You can see that at (J) below.
Second we must ensure that the array is actually in the sequence that we have told the compiler. In order to do this we also needed to add a SORTA (K) operation prior to beginning the search loop.
These are the changed/added lines:
(J) dcl-s valueArray Packed(7) Dim(10000) Ascend; (K) SortA valueArray; // Sort on value
How much faster is this approach? In a word: LOTS. In our tests of 10,000 searches on our 5,000 element array, the fast search outperformed the basic search by a factor of over 100 to 1. That's a big difference. You can see for yourself from the two program displays shown below. You can guess which one is which!
DSPLY Found 3285 - Missing 6715 - Time = 1.797 sec DSPLY Found 3148 - Missing 6852 - Time = .017 sec
In case you are wondering, we performed our timing by using %Timestamp() to capture the time immediately before and after the FOR loop and then using %Diff to calculate the number of seconds between the two.
The larger the array, the more dramatic the speed difference. For example, doubling the size of the array to 10,000 elements will result in roughly doubling the elapsed time for the traditional linear method. The time for the fast search though will not show any significant change. How can this be?
The answer is actually quite simple: The fast search enabled by the Ascend/Descend keyword is implemented by using a "binary chop" technique. In this approach the first comparison is against the middle element in the array. If that is not a match then the array is "chopped" into two parts, and a test is then made to determine if the required value is (potentially) in the lower or upper part of the array. Whichever part is selected is then subjected to the same process. I.e., the midpoint of the selected portion of the array becomes the next comparison position. Hopefully you can see from this example why doubling the size of the array has little impact on the time taken. The very first "chop" will reduce the scope of the array back to its original size—so the number of additional operations required is minuscule compared with the impact on a linear search.
There is, of course, a cost to this performance, but luckily it is small. The array must be in sequence for the "chop" technique to work and therefore must be sorted prior to the beginning of the search. Lest you think that that requirement would eliminate the speed difference, we should point out that when we included the SORTA in our timing loop, the fast sort was still 60 to 70 times faster.
So, given the speed advantages, should you always strive to search only sequenced arrays? The answer, as always, is "it depends.” In most cases the answer will be yes—at least for very large arrays and/or very many search iterations. For example, one time while working with a client who needed to significantly reduce overnight batch processing time, we used the approach of loading all of the account and product codes into arrays. All subsequent validations were made by performing searches on those arrays rather than use the conventional CHAIN/SETLL approach. The result was a massive performance gain. It even significantly out-performed the use of SETOBJACC or SQL.
But there are situations where the choice is harder to evaluate without trying it out. For example: Suppose that during the processing of a set of transactions it is necessary to build up a list of customers requiring further processing, and you need to avoid duplicate entries. One approach would be to perform a search to see if the customer is already present in the array and if it is not to add it to the end. Using the old linear search approach that is all you would have to do, because even though it is not in sequence the linear search would still find the entry should that customer appear again later. The problem with this approach is that the search will slow down with each and every new entry added. If there are a large number of customers being processed that could cause a problem.
However, if we take the high-speed approach, then we must potentially re-sort the array after each new addition. The only short-circuit logic we could meaningfully apply would be to check if the new customer is higher than the current highest entry and if so skip the sort. But that is about it. Under these circumstances, it is possible that much of the speed advantage would be eaten up by the additional sorting requirement. It all depends on how many customers are being processed and how frequently duplicates appear.
When building an array dynamically like this, and sometimes even when the array is simply filled and used, we also need to deal with situations where the array is not completely filled. As we mentioned earlier one approach is to pre-load high values into the array so that unused elements are always sorted to the end. But that is old-school RPG thinking—these days we have much better options.
Processing Partially Filled Arrays
Within the context of the example we have been using here, there are two capabilities in RPG that assist in handling partially filled arrays.
First there is the %SubArr BIF. This can be specified in place of an array name as shown in the code extract below at (L) where it has the effect of restricting the action of the SORTA to only the active elements of the array. The first parameter (value) specifies the array name, the second (1) specifies the first element in the range to be used and the third (count) specifies the number of elements to be included in the range. This is particularly useful when the array is loaded dynamically during processing.
// Sort active elements in valueArray (L) SortA %SubArr( valueArray: 1: count ); (M) location = %LookUp( searchFor: valueArray: 1: count );
The second useful capability is the option to limit the range of elements that are to be searched by %LookUp. This is done by specifying the optional third and fourth parameters. In the example at (M) the third parameter (1) specifies the number of the first element to be searched and the fourth (count) once again specifies the number of elements to be searched.
If you download the code bundle that accompanies this article from our website, you will see in program ARRAYS3, from which the above code lines were taken. We also added the ability to specify the number of elements to be loaded.
In the next article in this series we will discuss using group fields to define arrays, the usage and limitations of Data Structure arrays when searching, and alternative sort and search strategies.
If there are any specific aspects of array processing you would like us to cover please let us know via the comments section.
Generating Test Values
If we were generating test data to populate a database, we would probably have used a tool like Faker, or the web site generatedata.com, which we described in our article "Two Tools for Testing" back in May 2015. However, for the purposes of array testing all we needed was a series of random values so we used the IBM API CEERAN0 pseudo random number generator to do this. This API returns a floating point value between 0 and 1 which, while it may seem a little odd, makes it ideal for our purposes as you'll see in a moment.
For those unfamiliar with this API here is a short description of how we have used it.
// Simple procedure to fill requested
// number of array entries with random
// values up to the specified maximum
(A) count Int(10) Const;
max Int(10) Const;
dcl-s i int(10);
(B) dcl-s seed int(10) Inz;
dcl-s floatRand float(8);
(C) dcl-pr CEERAN0 ExtProc;
feedback Char(12) Options(*Omit);
For i = 1 to count;
(D) CEERAN0 ( seed: floatRand: *Omit);
(E) valueArray(i) = floatRand * max;
The procedure interface to our FillArray routine (A) defines two parameters. The first (count) specifies the number of values to be generated and the second (max) the highest value that is to be generated.
At (B) we defined two working fields. The seed value for the generator (seed), and the field into which the generated random number will be placed (floatRand). The seed is initialized to zero - this causes CEERAN0 to use the current time as a seed value. If you wanted to ensure that the exact same set of numbers was generated each time (for testing purposes perhaps) then setting the seed to a constant value of (say) 1 would ensure this.
The CEERAN0 prototype at (C) defines three parameters, the third of which (error feedback) we will not be using so we have specified it with Options(*Omit). Normally such an API would allow us to simply omit such parameters—i.e., we would use Options(*NoPass)—but this API does not support optional parameters. A word of warning here: We have seen many examples where the author has got away with simply passing two parameters. In most cases it appears to work correctly because the API will only attempt to use the third parameter in the event of an error. But if an error is triggered, watch out—almost anything could happen. The call to the API takes place at (D). The important thing to note here is that both the seed and floatRand values will be changed on the call. The reason that the seed is changed is to ensure that an appropriate seed value is supplied on the next call. Without this the API might return the same random number over and over again.
Finally, we create and store the randomly generated value (E). This is achieved by multiplying the returned random number ( a value somewhere between 0 and 1) by the maximum value we want in our test data. In theory, since we are generating 5,000 values with a maximum value of 10,000, our search for each possible number in the 1 to 10,000 range could return a 50-percent hit rate. However, in practice the actual rate is lower due to the likely generation of a few duplicate values.
Jon Paris and Susan Gantner are technical editors for IBM Systems magazine, Power Systems.More →