A Closer Look at RPG's DATA-INTO
An introduction to a more generic version of our CSV parser, along with trace support that IBM has built in to facilitate the development of parsers.
By Jon Paris03/01/2018
In last month's article “XML_INTO Meets Open Access With RPG’s New DATA-INTO,” we discussed the basic features of the new DATA-INTO Op-code, and the basics of writing parsers for it. View the PTFs and Op-code here.
This time we want to touch on a couple of the additional items that we mentioned in that article, and to introduce a more generic version of our CSV parser. We will also describe the trace support that IBM have built in to facilitate the development of parsers.
Adding Functionality to the Parser
We'll begin by discussing the changes we made to the CSV parser. To keep our original example simple, that parser was hard-coded to expect exactly two specific fields of data in the CSV file being parsed. Here, we’ll improve that original parser by allowing it to work with variable numbers of fields and the CSV data can supply the names of the DS subfields expected. Making the code more generic also requires that we look into how to deal with error conditions detected in the CSV data uncovered during parsing and how to determine when specific data was missing.
We won't be showing a lot of the code, since so much of it is unchanged from the previous article, but if you would like to examine it in full detail we will post a link to the source code bundle soon. You'll find that in addition to the main parser (PARSECSV2) we have also included a number of test programs and the IFS files that they utilize.
We decided to take the route of using the file's header row to supply the field names since such CSV files are common. Once we’ve looked at this implementation we'll discuss alternative approaches.
(A) dcl-ds CSVData_T qualified template; count int(5); value varchar(256) Dim(100); end-ds; .... (B) // Structure to store column names dcl-ds columnData LikeDS(CSVData_T); // Structure to store record field data dcl-ds fieldData LikeDS(CSVData_T);
We began by defining the template DS CSVData_T (A). It simply contains a count of the number of active elements, and the values of those elements. The element size (256) and the number of elements (100) were arbitrary choices that should be big enough for most purposes. By using a template, we ensured that should we ever need to change these sizes in the future that this can be easily achieved.
We then create two instances of this template (B). columnData, which will be used to retain the column headings (i.e. field names) extracted from the first record in the file, and fieldData, which will hold the field values extracted from subsequent records.
The process of loading the column headings begins at (C) by obtaining a record. The resulting value is tested (D) to ensure that data was obtained. If nothing was found then the DATA-INTO API QrnDiReportError() is used to report the error. Note that once the error has been reported control will not return to your parser but rather RPG will issue an error message including the error code that you supplied as the second parameter on the call. In our case we give this error status an error code value of 1 which is defined by the constant NO_RECORDS. The number you use is entirely up to you, but we suggest that you establish some shop standards to document which code means what. In this situation, the message in the job log looks like this:
The parser for the DATA-INTO operation detected error code 1.
At (E) we load the column heading data structure by calling our getFields() procedure and passing it the record. This routine handles the field separators as well as any special characters such as double quotes around strings. We'll discuss the mechanics of the routine in a future article.
By the time that control returns to the main line again, the record headings are all in place within the value array of the DS columnData and the count field has been set with the number of active entries. The significance of this count will become apparent later.
(C) // Get column heading record i.e. the first one record = getRecord(parseInfo); (D) If record = ''; // No records found so issue error message QrnDiReportError(parm.handle: NO_RECORDS: 0); // NOTE: There is no return from this call Else; // Load column headings from fields in first record (E) columnData = GetFields(record); EndIf;
Once the column names have been obtained we process each of the records in the file calling the getFields() routine to populate the fieldData data structure. We then proceed to loop through all of the fields that we found (F).
Notice that at (G) we test to see if the current index exceeds the count of the number of column names that we found. This would indicate that there are more fields in the record than we were told to expect. This is considered an error and is reported with the QrnDiReportError() API as we did earlier.
For each field for which we have a column heading (i.e. field name) then at (H) we report that name followed by its value. This continues until all the fields in the record have been processed.
One quick point on column names: By default, DATA-INTO will automatically trim spaces from the data vales passed to it for storage. However, we found out the hard way that it does not do this for names. Trailing spaces are not a problem, but leading spaces must be removed from the names or you will get a field name mismatch.
At this time, you might be wondering what would happen if we have less data that we have column headings. In other words, field(s) have been missed off the end of the record. Catering for this is not actually a function of the parser but rather is controlled by DATA-INTO itself. We'll see how that works when we look at the sample program that utilizes this parser.
(F) For i = 1 to fieldData.count; (G) If i > columnData.count; // More fields than headings? QrnDiReportError( parm.handle : TOO_MANY_FIELDS : parseInfo.lineStartOffset); // NOTE: Control will NOT be returned from this call Else; // Report name of subfield using the column heading (H) QrnDiReportName ( parm.handle : %addr(columnData.value(i): *data) : %len(columnData.value(i)) ); // Report the value to go into that subfield QrnDiReportValue( parm.handle : %addr(fieldData.value(i): *data) : %len(fieldData.value(i)) ); EndIf; EndFor;
Before we move on to look at the sample program, one quick comment on alternative options for handling the column names where there are none in the file. The first and most obvious would be to simply generate names for each column found in the record, e.g.,: column1, column2, column3 and so on. While this is an easy solution, the major disadvantage is that it results in some pretty meaningless field names in the user program. It’s far better to have meaningful names, so a better approach would be to supply the names of the fields via the additional user parameter available on DATA-INTO's%PARSER parameter. We will discuss how that works in a future article, but if you can't wait that long, IBM describes its usage in some of the example programs that ship with DATA-INTO.
Using the Parser
Here is a sample of the data in the file Sample2.csv that will be processed by this program.
Account,Name,City,State 3456,IBM,Rochester,MN 4567,"American Falls","Niagara Falls",NY ....
At (J) you can see the DS array that we have defined to hold the data retrieved by DATA-INTO. Note that the field names must match those in the column headings of the first record in the file. In this instance, the fields are in the same sequence as the columns, but this isn’t compulsory. DATA-INTO, like its big brother XML-INTO, is only concerned with the name of the field within a given hierarchy, not the sequence of the elements at any given level.
At (K) you see the DATA-INTO operation which identifies the DS accounts as its target. The %DATA BIF identifies the source of the data to be processed and the options to be applied. Just as with XML-INTO, the doc=file option tells it that the first parameter represents the name of a file, rather than the actual data to be processed. Similarly, case=any allows the file's column headings to be in mixed case and still match the RPG field names. As in our previous example, %PARSER identifies the program to be used to parse the document.
Once the DATA-INTO operation has completed, the element count in the PSDS is used to iterate though the values and display the results (L).
J) dcl-ds accounts Qualified Dim(200) Inz; account char(4); name char(30); city char(30); state char(2); end-ds; dcl-s pad char(1) Inz; // numElements contains a count of the number of active // elements placed in the accounts DS by DATA-INTO operation dcl-ds pgmStat psds; numElements int(20) pos(372); end-ds; dcl-s i int(10); (K) data-into accounts %data('Sample2.csv' : 'doc=file case=any ccsid=job ') %parser('*LIBL/PARSECSV2'); (L) for i = 1 to numElements; dsply ( 'Account: ' + accounts(i).account + ' City: ' + accounts(i).city); endfor;
DATA-INTO shares a lot of functionality with its older brother XML-INTO, including the ability to handle situations where an expected value is missing. As with XML-INTO, one possible way to deal with this is by specifying the option allowmissing=yes. But, as we have pointed out when wiring about XML-INTO, this is a very blunt instrument since it ignores all missing fields, not just the ones that you might consider optional.
For example, in our test scenario, perhaps we have records for non-U.S. customers. In those cases, no state value would be expected, but we still want the rest of the data to be present. The better option is to use countprefix as shown in this modified version of the accounts DS and DATA-INTO operation.
dcl-ds accounts Qualified Dim(200) Inz; account char(4); name char(30); city char(30); state char(2); (L) count_state int(5); end-ds; .... (M) data-into accounts %data('Sample3.csv' : 'doc=file case=any ccsid=job - countPrefix=count_' ) %parser('*LIBL/PARSECSV2'); (N) for i = 1 to numElements; if accounts(i).count_state = 0; // No state info dsply ( 'No state info for Account ' + accounts(i).account ); ....
You can see at (L) that we added a field named count_state to the DS and at (M) specified that a count prefix of count_ was to be used. The result is that DATA-INTO will keep track of whether a value has been placed in the state field or not by placing a 1 in count_state when a value is loaded into state, otherwise count_state would be zero. This condition is tested for at (N) and a message issued to note the omission.
Because of this logic a line in the CSV file such as this:
Will no longer cause an error due to a missing field. Just as with XML-INTO, the simple way to think of it is to say that DATA-INTO will only trigger a "missing" error if you have not given it a way to notify you of the situation. So while the line with the state column omitted is OK, if we modified the line to look like this:
where the city information was also omitted, this would still cause an error as no count_city field exists in the DS.
One other point about the %DATA options. Since we are using column headings to supply field names, you might have wondered what would happen if a column name such as “Account Number” was used. Spaces aren’t valid in RPG field names, so the default behavior for DATA-INTO would be to treat the name as a mismatch. However, the case=convert option can be used to handle this. Any spaces encountered would be turned to underscores if this option is used. So “Account Number” would match to the field name Account_Number.
Developing a parser can be tricky until you get the hang of it and, unlike Open Access, each parser will be different and therefore will not lend itself to the program template approach that we use with Open Access.
In particular, should the parser "blow up" (not that our code ever did that of course!) it can be really difficult to understand just what the DATA-INTO runtime has "seen". For that reason, IBM supplied a very nice debugging option that when enabled will give you lots of information about what has happened.
To enable tracing, you must set the environment variable QIBM_RPG_DATA_INTO_TRACE to a value of *STDOUT. You can do this with the WRKENVVAR command. Once you have done this, you can run the simple program DSPSTDOUT that we have included with the code package to see the trace data on the terminal. This program was based on early IBM documentation and we’re not sure whether it will ship as part of the compiler or just appear in the documentation.
More often though you won't need to do this deliberately as it will be shown automatically whenever DATA-INTO encounters an error and blows up. In the future, IBM intend to offer other options, such as to allow the trace data to be written to a file. But we’ve found the current option sufficient to help us in our debugging.
Just to show you the kind of information supplied, here's an extract from a trace generated when processing a file that had too many fields in the record.
.... ReportValue: 'Canadian Customer' ReportName: 'City' ReportValue: 'Mississauga' EndStruct StartStruct ReportName: 'Account' ReportValue: '6789' ReportName: 'Name' ReportValue: 'Too much data' ReportName: 'City' ReportValue: 'Anywhere' ReportName: 'State' ReportValue: 'AL' ReportError, RC = 2 - Bytes parsed: 231 Terminating due to error
Notice that the information is indented to show the hierarchical level (i.e. the nesting of structures) as viewed by DATA-INTO. You can also see the results of each and every call that the parser made to start and end structures, report names and values, etc. You can also see that our error code 2 (too many fields) shows up, along with the position in the file at which the error was detected.
That’s useful stuff. But wait, there's more! You can also add your own debugging information to the trace by calling the QrnDiTrace() procedure and supplying a message and nesting level information. We really like this feature.
The DATA-INTO capability will be released very soon for 7.2 and 7.3. Once that’s happened, we'll publish a review of the samples provided with the package, walking you through some of the features that they highlight.
We hope you have as much fun playing with DATA-INTO as we have. Perhaps more importantly, we look forward to seeing what third parties and open-source groups can do with it. If you have any questions or thoughts on this new support please let us know.
Jon Paris is an IBM Systems magazine, Power Systems edition technical editor. More →
Post a Comment
Note: Comments are moderated and will not appear until approvedcomments powered by Disqus