Therefore the choice of which file system features to include and exclude has primarily meant concentrating on things that add functionality, while leaving out things that are only there on Model 204 to achieve an extra bit of speed. For example the absence of the "Key", "Numeric Range" and "FRV" field attributes was deemed acceptable, since the b-trees used by the "ordered" attribute provide the same functionality. On the other hand, invisible indexes were not left out because they are required to support the astronomically useful FILE RECORDS statement..
Almost all User Language database-related functionality is currently supported.
In addition to being less functionally overloaded than the Model 204 file system, DPT has to take account of some major differences resulting from the fact that we are not on the mainframe. To take the initial implementation on Windows as a case in point:
The key ingredients are:
In the case of field-level b-tree structural control settings, the only one available is SPLITPCT. In other words NRES and LRES are not used (SPLITPCT has the same function but might therefore need to be changed after a load on DPT). The IMMED parameter is not used either. DPT applies various IMMED-style measures within the index structures, but these are under automatic control.
Overall structure
A DPT file consists of two main parts, namely:
Magic numbers
B-trees
DPT's b-tree implementation undoubtedly differs from Model 204's in many details. However there is one particularly large point of contrast, which is that each field has its own b-tree with root page etc., instead of all fields sharing a single tree with entries prefixed with the field code, as on M204. In general usage this difference should not really be noticeable.
The root page for each field, as well as being the root of the tree, maintains various information which DPT can use to improve its handling of the tree in general use.
Inverted lists
DPT files do not keep any inverted list information in the b-tree leaf pages, instead storing them as entirely separate entities. This means the growth characteristics of b-tree data structures do not get mixed up with those of inverted list data structures, and makes things much easier to handle internally.
In practice the difference means that DPT b-trees are likely to be more compact than their M204 equivalents, and many kinds of value and search processing will require a few less disk reads. Of course there are no free lunches, and other kinds of processing, specifically when lots of inverted lists are accessed, will require a few more disk reads.
The ANALYZE command can show in more detail what's going on in any particular situation, and also has some extra interesting DPT custom options.
BLOBs
The BLOBs chapter later on includes some notes on BLOB data structures on DPT.
Top
The minimalist forms of the command are:
ALLOCATE MYFILE MYFILE.DPT //database file ALLOCATE OUTDATA OUTDATA.SEQ //sequential file ALLOCATE STDPROC STDPROC //procedure directory
ALLOCATE MYFILE MYFILE.OLD DIRECT //error - must have DPT extension ALLOCATE OUTDATA OUTDATA.OLD SEQUENTIAL //OK - sequential file need not be .SEQ ALLOCATE OUTDATA OUTDATA.DPT SEQUENTIAL //error - sequential file must not be .DPT ALLOCATE STDPROC OUTDATA.DPT PROCDIR //error - must be a directory not a fileAs on the mainframe, there need be no correspondence between the "DD name" (MYFILE here) and the "dataset name" (MYFILE.DPT). However, unlike Model 204, the physical file contains no information relating to the DD name, meaning there is no difficulty in "renaming" a file, for example as DBAs often have to achieve on Model 204 by using the RESTORE 192 option. In this example MYFILE.DPT could be used as MYTEST simply by re-allocating it:
FREE MYFILE ALLOCATE MYTEST MYFILE.DPTThe "DSN" (MYFILE.DPT in the above example) is held in the FCT, and viewable as parameter OSNAME. If you do rename the OS file, DPT issues an informational message the next time the file is opened. Note that OSNAME does not hold the full path of the file, only the actual file name.
ALLOCATE DB DB.DPT //dpt\DB.DPT ALLOCATE MYFILE 'MY DATA\MYFILE.DPT' //dpt\My Data\MYFILE.DPT ALLOCATE MISCHIEF C:\SYSTEM.INI SEQUENTIAL //absolute file name ALLOCATE WORKDIR . //dpt base directory (two dots would be the parent of that)The commands shown here are all in upper case, as per the most common M204 CASE parameter setting. However, file names on Windows and some other operating systems are not case sensitive, meaning that the actual file names may be a mixture of cases and these commands will still work. The values for the "DD" and "DSN" are uppercased for internal DPT use anyway (e.g. if you VIEW OSNAME).
Like Model 204, DPT writes control information to the FCT page of a database file, even if no explicit updating is going to happen to the actual data in the file. This means that database files on read-only media such as CD-ROM cannot be accessed. Depending on how the drive is mapped, a CD-RW file might or might not work. Sequential files can be read from read-only media so long as they are declared as such at allocate time (see below).
ALLOCATE DEMO DEMO.DPT NEW //NB. the default is OLD ALLOCATE DEMO1 DEMO1.DPT COND //OLD if it exists, NEW if it doesn't ALLOCATE DEMO2 DEMO3.SEQ MOD //start writing at the end ALLOCATE INDATA T1.DAT SEQ READONLY //might be on CD-ROM or tape ALLOCATE SCRATCH TEMP //system generated dsn in #SEQTEMP dirIn the case of NEW or COND, we also need to think about space parameters. When an "empty" file is created on Windows, it has zero space allocation and notionally occupies no disk space. What's more, as mentioned earlier, it is up to applications to control file sizes themselves - the OS will never tell us that a file is "full" like MVS does with a B37 abend, only when the disk is full. Therefore, the notion of specifying space parameters on an ALLOCATE command for a new file could not have the same meaning.
If the file is to become a database file, space parameters will in any case be specified with the CREATE command (BSIZE etc.), as described later, so there is no need to do so now. If the file is to be used as a sequential output file, setting a maximum size is worthwhile, to prevent runaway processing. All the mainframe parameters pertaining to file size are distilled into a single custom parameter, namely MAXSIZE, which can be given whatever the disposition (i.e. OLD too). The value is in units of 1K (1024 bytes), and can range from 0 (no check) to 2G (so max file size 2TB). When executing an image write or print to a USE file, if the file is larger than this an IO error is reported.
TEMP sequential files are placed, logically enough, in the #SEQTEMP directory. DPT uses a system-generated DSN, and any other DSN specified is ignored. These files are useful to avoid cleanup work, because DPT automatically deletes the underlying file after the FREE command is issued, or at system closedown.
Next the tricky area of record format and length. The most common convention on the PC is for "sequential" data files to contain records of arbitrary length, with line separator characters or character sequences denoting the ends of records. In fact this is so prevalent, despite the restriction it puts on what can be in the actual data, that both USE and image IO work like this on DPT by default. The end-of-record separator on Windows is the 2-character sequence of X'0D' (carriage return) + X'0A' (line feed), known as "CRLF". On Unix or Mac if DPT ever goes there it might be just LF, or just CR, respectively. A less universal convention is whether such files of "records" have a final terminator after the last record. READ IMAGE will handle it either way. WRITE IMAGE and USE output will always write the final EOR.
Despite the use of the above convention, there is still a role for a record length option on ALLOCATE. If no LRECL is specified, the file behaves as if variable length records were being processed. In other words USE and WRITE IMAGE will not truncate and READLEN after READ IMAGE will vary. The following options are only allowed on sequential files.
ALLOCATE TESTC TESTC.SEQ LRECL 2000 //CRLF present but simulate fixed length records
In this case the file will now behave as if fixed length records were being processed, although at the actual disk level they are variable in length and terminated with CRLF as per the default. So USE and WRITE IMAGE will produce records padded or truncated to 2000 actual data characters (i.e. 2002 bytes on disk including the CRLF). During READ IMAGE the image is populated as if the input record were 2000 bytes long, even if much shorter on disk. And READLEN will be 2000.
You also have the option to control the pad character used during WRITE or USE. It is given as a number which is the ASCII code (in decimal format) for the desired character. The default is 32 which is a space.
ALLOCATE TESTC TESTC.SEQ LRECL 2000 PAD 0 //pad with ascii 0 characters
Finally there is an option to make it so that true fixed length records are written, with no CRLFs in between. This option is essential when using READ IMAGE with floating point data, since such items may contain CR and LF characters within the FP bit pattern. It does however require more coordination between the processes that write and read the file, since the disk record is not free to vary in length and still get processed successfully by READ IMAGE.
ALLOCATE TESTC TESTC.SEQ LRECL 2000 PAD 0 NOCRLF //as above but change EOR convention
For example:
ALLOCATE MYHTML #WEB/HTML PRSUFFIX=.HTML ALLOCATE MISCWEB "Miscellaneous web stuff" PRSUFFIX=C''After these commands any files in MYHTML with the extension .HTML (case-insensitive) can now be accessed via procedure commands and the client GUI without giving the extension. Files in MISCWEB can also be accessed as "procedures" but we just have to give the file extensions too as part of the proc names. This would be useful if the directory contained various types of file, or even if we just wanted to dispense here with the "hidden extension" scheme normally used to emulate Model 204 procedure names.
Note that if you create new procedures, either at the chevron with the PROCEDURE command or via the various GUI routes, DPT will give them uppercase names unless you issue *LOWER beforehand. All procedure handling is case-insensitive, but you may prefer lowercase file names for aesthetic reasons.
Procedures and data in the "same file"
Some Model 204 applications operate with one or more files that contain both data and procedures. This is handled on DPT with a special ALLOCATE parameter to get round the fact that the data file and the proc directory must be allocated on separate DDs. For example
ALLOCATE APPFILE C:\DPT\APPFILE.DPT ALLOCATE APPFILEP C:\DPT\APPFILEP ALTNAME=APPFILE
The ALTNAME means that you can open and use the proc directory using either name. A context called APPFILE can be opened against both the DDs above, and both procedure and data related processing can be performed in it. Note that once you open a directory with either its normal DD name or its ALTNAME, that's the name you have to stick with for all "IN xxxxx" type processing. (Unless you open it both ways, which is allowed).
The FREE command can also be used with an ALTNAME, so for example in the above case "FREE APPFILE" would actually try to free two DDs, in a similar way to OPEN.
The Navigator pane on the client shows the ALTNAME whenever one is defined, on the assumption that if you have defined an ALTNAME that's the name you wish to use.
Database file FCT pages are kept permanently in buffer while any user has the file open, so MAXBUF must be at least as many as the number of files that will be concurrently open in the run.
Top
From the point of view of the Model 204 FORMAT/NOFORMAT option, the command always behaves as if NOFORMAT were specified (i.e. the quick version), although with large files Windows can still take a few seconds to allocate disk sectors etc. Fresh pages are formatted by DPT as the file grows.
The space occupied in bytes after CREATE is:
(1 + BSIZE + DSIZE) * 8192Where the 1 is the FCT page, and BSIZE/DSIZE are used for M204-familiarity reasons to size the main areas of the file.
The INITIALIZE command works much as it does on M204.
Managing file expansion in general use
If a file fills up one of its tables, UL or API programs will fail in a controlled way, as you would expect. At that point a Model 204 DBA would consider a number of options, the main relevant one here being INCREASE TABLEx. With DPT, if there is space left on the disk, the INCREASE command expands a file into it - the space does not come from FREESIZE. There is no FREESIZE parameter.
Repeated expansion of a file on any platform often causes fragmentation. On a system like M204, which manages its own data structures within a single OS file entity, repeated expansion of the internal managed structures causes internal fragmentation as well (e.g. repeated alternate INCREASEs of tables B and D), meaning that there are potentially two levels of fragmentation, with the consequent potential performance degradation. There is currently no facility on DPT to allow the obvious simple solution of over-allocating and then shrrinking to fit (DECREASE is not supported). This issue may be addressed in future releases.
Finally a note on field definitions. These are stored in the file in a sequence of pages that will happily extend itself as more fields are defined, so long as there is room in the file. Therefore, as with M204, it is possible that DEFINE FIELD will cause a file-full condition, but it will be when DSIZE is reached, not ATRPG. The current usage level of field attribute pages can be seen by looking at the ATRPG parameter (i.e. the existing M204 parameter has a slightly different meaning) and ATRFLD (new parameter).
On DPT reorganizations can be performed in the following 3 main ways.
Since the first two are fully automated, the following are just examples of the third.
=UNLOAD CREATE FILE SALES BRECPPG = 200 FILEORG = x'24' END INITIALIZE =LOAD
=UNLOAD EXCLUDING SALES_2001, SALES_2002 INITIALIZE =LOAD
Insert extra commands after the INITIALIZE to DEFINE the fields with their new storage attributes. All the other fields will be defined by the load as they were before from the TAPEF information included in the default =UNLOAD. During reload the data and/or index values for the affected fields are read in, interpreted as per the old format, then converted and stored in the new format. (In some cases there may be issues with unconvertable numeric values during this process).
=UNLOAD INITIALIZE DEFINE FIELD PRODUCT_ID STRING ORD CHAR DEFINE FIELD COUNTRY_CODE FLOAT =LOAD
But some can not:
Certainly these could be handy in occasional cases, but they are specialized features of the REDEFINE FIELD command which would significantly complicate the internal reorg processsing, and it's not really worth reinventing the wheel for rare cases. If you need to do these things, in the first case issue the REDEFINE command(s) before the reorg so that the reorg then repacks table B for the larger records. In the second case it probably doesn't make much difference whether it happens before or after.
When redefining a field from NON-ORDERED or ORD CHAR to ORD NUM, the same comments apply as above. In other words you can use the parameter to force it so that nun-numeric values in the STRING field on the table B record, or in the ORD CHAR index respectively, are allowed and all amalgamated under a single ORD NUM index entry for zero.
The main thing is that DPT wants to keep things so that the standard conversion of the numeric component results in the string component exactly as it is stored. To an end user this would mean that a User Language program could be written which would find the index entry given the value off the table B record, or find the record given the value from say an FRV loop. Internally to the DBMS this conversion is crucial, since when deleting or changing fields on records the index entries must be located using the old values in table B.
One case where this is affects update operations is when a value is supplied in string format and conversion for the numeric component of the field fails. In such cases (assuming FMODLDPT is set appropriately - see above) both components are stored as zero. A less-obvious case is when a non-standard but valid numeric string such as "1E3" or "1000.000" is given. With both of those values the numeric field component is stored as floating point 1,000, and the string component is stored as the standardized version which would be printed by a UL PRINT statement, namely "1000".
With a STRING ORD NUM field in particular, DPT will allow the table B data to be stored as the non-standard but valid number (e.g. 1E3) since it is then still possible for the DBMS during DELETE or CHANGE operations to locate the index entry. This concession is controlled by the FMODLDPT X'02' bit which is by default active (allow). Turning off this bit causes the table B data to be converted and stored in standard form ('1000'). You should remember that if you do a User Language DETETE or CHANGE by value, the form stored in table B is used to locate the old value, so for example CHANGE SCORE = '1000' to '2000' would not work if it had been stored as '1E3'.
FLOAT ORD CHAR fields are unaffected by the FMODLDPT X'02' bit. The ORD CHAR b-tree entry will always be numerically standardized as previously described.
Finally note that if you use fast load and supply your own index information (TAPEI) the above format synchronization is not performed since the data and index updates happen at different times. If the source is another DPT file or a Model 204 file this should not be a problem though.
On DPT, like M204, all database update work happens a great deal faster with TBO and checkpointing turned off. In fact deferred update mode (see later) actually requires TBO to be turned off.
Unlike general-purpose DBMS processing, dedicated large scale data-load jobs do not benefit from a large buffer pool (MAXBUF), since each page is only written once and not returned to.
DPT fast load accepts several different data formats (see appendix), of which the simplest is "PAI" style, as traditionally used on Model 204 and generated by the sample program "DPT_EXTRACT" in DEMOPROC. If you're prepared to make more effort, the lower-level formats reduce file transfer times and increase load speed at the DPT end (see benchmarks). It is also possible to extract and load existing index information rather than have DPT build it afresh during the data load.
So in summary there is a lot of scope for playing around with it, but the simplest data transfer process would be something like:
Other general notes however you do it:
The so-called "fast" processing is fast for several reasons. Firstly, even in simple cases it easily beats the equivalent User Language + image program, by bypassing both the UL runtime and DPT's sequential file emulation layer. Secondly, rather than necessarily generating index entries at load time from the incoming data, there is an option to supply some or all of them them pre-built. Thirdly, DPT version 3.0 includes a lot of new optimized algorithms in critical places. The main downside of fast load/unload is that the layout of the input/output files is not infinitely flexible like it is if you write your own load program.
Fast load/unload processing is invoked using DPT commands, or their equivalent UL $functions, all of which have more detailed notes in the DPT language guide - see =UNLOAD, =LOAD, $UNLOAD, $LOAD. The GUI interface in the File Wizard utility is also convenient for ad-hoc jobs.
Users and DBAs might typically make use of these features to perform 'custom' reorg as covered in the reorgs section of this document, or to load data previously extracted from a Model 204 system on the mainframe. In addition to these situations, DPT uses the fast unload/load functionality under the covers during processing of the REORGANIZE command and some REDEFINE FIELD commands.
In addition, the files must always conform to the following naming convention (you can alter the directory with command options).
Unload
A complete unload creates the following output files:
Load
When issuing the =LOAD command, the type of information loaded depends entirely on which, if any, files matching the above name pattern are present in the input directory. In addition, depending on options, the load is capable of accepting index information for all fields in a single file, namely:
Fast unload and fast load increment most stats but there are some they don't because they bypass the regular processing. For example BADD is not incremented during a load, since table B records are usually written in a single block, and depending on the active options fast load may not even look at the fields within the record.
Fast load will take account of the reuse queue in RRN files, but if the reuse queue pages are highly fragmented that will severely hamper its ability to load records in large extents. RRN files should be reorganized regularly anyway.
DPT's deferred update facilities are a little more basic than Model 204's. Specifically, the feature was added to DPT for use in straightforward load jobs rather than more general processing, and as such only allows record STORE and field ADD operations to be deferred. Other types of update will fail when deferred update mode is active for a file. Deferred update mode can also only be activated when the system as a whole has TBO turned off.
There are two flavours of deferred update processing covered in this section, as follows.
Note that like on Model 204, only the first OPEN statement for a file in any given run should have this special form. After that, use plain OPEN or OPENC as normal, and the file remains in deferred update mode. Further adorned OPEN commands fail. If the file is freed, or the system restarted, a new adorned OPEN command must be issued before updates can be performed again (since the file remembers it is in deferred update mode even when the system is down - FISTAT X'20' is stored on the FCT page)
Once the deferred update sequential files are attached, field add operations from both STORE statements and explicit ADD statements get applied to the data part of the file directly, but not to the indexes. Index updates are written to one of these two files.
Appendix 1 contains a description of the various record layouts in which DPT might generate sequential deferred update data. If you don't simply crib the sample job, that information will be required to configure the sort program parameters.
Home users perhaps don't have a file sorting utilities to hand. Well there are plenty available on the web, and some are even free. Development of this area of DPT was been done using a freeware program called CMSORT (www.chmaas.handshake.de), which is a very basic easy-to-use tool, and works very well even on large files of tens of millions of deferred update records. It is not really ideal though, mainly because it cannot sort raw floating point values. The demo batch job which comes with the DPT download and uses CMSORT therefore uses numeric format N2 (stringized numbers).
When loading deferred updates that are well sorted, not only does it happen very quickly, but you can pack b-tree nodes much more tightly without fear of later values coming along and causing splits. Set SPLITPCT for the indexed fields up to the percentage fullness that you would like the nodes to finish on. The choice of this value would depend on how much further updating you were expecting after the load - for completely read-only files you may as well pack nodes to the maximum degree (SPLITPCT=99 assuming ascending sort order). If randomly-ordered new values are expected after the load, set SPLITPCT to something nearer its default value of 50 to minimize the number of splits caused by those later updates.
After Z has finished it is assumed by the system that the file is in a physically consistent state, and FISTAT X'20' is turned off. It trusts that you (or the sort) haven't lost or corrupted any of the index updates. If that happened, either the Z command would fail, or the file would crash later during general use. To load a set of index updates in multiple file chunks for whatever reason requires that the file is re-opened into deferred update mode again after the Z for each chunk.
As with the first step of the process, a large buffer pool is of no real benefit here, since each b-tree node and inverted list page is only written once and never retrieved again.
Available from DPT V2.14, this mode is always preferable to multi-step, being much simpler to set up, and almost certainly much faster (e.g. see version 2.14 release notes benchmarks.) For typical data loads, the fast load feature is much faster still, but since they share a lot of code underneath, the following notes are not obsolete.
The index data is written out when the last user closes the file, or if there is more data than will fit in real memory, periodic partial flushes are performed as and when memory fills up. The user has some control over exactly when and how these flushes happen.
Compared to the multi-step process, this scheme
Each time DPT performs a partial or final flush of the index data, a block of statistics and other information is written to the audit trail, so you can see how things are progressing, and tweak various file and field parameters etc. for next time if required.
On a machine with other major applications running, specifying too high a figure here could cause memory wars and result in a lot of Windows page file access, thus defeating the whole point. The best approach is of course to run loads on an otherwise-quiet system, and give DPT as much memory as possible. The optimum value will depend on each machine (amount of RAM installed) and Windows version (Vista for example has a large working set, but reportedly suffers less from heap fragmentation). Check Task Manager during the load - an increasing value in the 'page faults' column would indicate Windows swap file access, which is not good. Note that many versions of Windows Task Manager under-report process memory usage, or at least the Task Manager interpretation of the data is not what you might expect.
If LOADMEMP is specified too low, a load may start but give up after one or more chunks (leaving the file physically inconsistent). This is simply a consequence of the fact that below a certain point it is not really possible to equate Windows heap (virtual memory) usage with a "% physical RAM usage". Generally it doesn't make sense to run with much less than the maximum anyway, in which case this shouldn't be a problem.
Like a multi-step load, this process does not benefit from a large "regular" (MAXBUF) buffer pool. In fact, since every 8K buffer page allocated is 8K less for sorting deferred updates, the single-step process actually insists on a small buffer pool.
The single-step process makes heavy use of Windows heaps, which can become fragmented over time. During a long load you may see the size of later chunks become a little smaller than earlier ones. This is normal.
Depending on the data characteristics, building an index b-tree once and then enlarging it again and again (option 2) can be extremely expensive, and this is why option 1 is the default. However, choosing option 2 can give somewhat improved speed in some cases, by avoiding sequential I/O and the merge. Specifically, it is worth trying for:
SPLITPCT:
B-tree entries are loaded in order by this process, so SPLITPCT=99 is right in most cases, assuming the file is empty to start with. If you have some NO MERGE fields that will be getting some leaf splits, set SPLITPCT to some value less than 99 as appropriate.
Updating multiple files
If several files are updated in single-step mode at the same time, updates are offloaded to temporary sequential files whenever any file update detects that LOADMEMP% has been reached. Note however that the file only then offloads its own deferred updates. Normally this is no problem, since if all files are getting plenty of updates each will offload occasionally and memory will get used productively, even if the average offloaded chunk size is a lot smaller than it would be in a single-file update situation. However it is possible that one or more files might sit for some time hogging a lot of memory but neither getting closed nor receiving many more updates, thus causing other files to repeatedly offload sooner than necessary. In cases like this make sure all users close files when their updates are finished, or if updates to one file come in bursts among updates to other files followed by long delays, the =Z1 command allows you to request offloads explicitly, and you can issue this periodically as required, for example via $COMMBG.
OS File handles during merge
Since host machines and OS platforms vary, performing larger loads on smaller platforms may require so many chunks than the merge phase cannot open all the temporary files concurrently. In such cases the merge happens in more than once pass. Generally this is transparent but the user does have some control if required using the LOADMMFH parameter.
Diagnostics
At diagnostic level 4 (LOADDIAG parameter) some of the stats gven are estimates, since it is not known at the end of a chunk whether an inverted list will eventually become e.g. a table D list or bitmap.
As with subsystems there is no option to store groups permanently - they must be defined in each run, for example by user 0 as illustrated in the demo installation. However, for compatibility reasons, and also to maintain the distinction between system-wide groups and user-specific groups, the TEMP and PERM options on CREATE GROUP are still used - PERM meaning simply a system-wide group, and TEMP a user-specific group. Apart from this there is no difference in processing between the two types of group (for example, on Model 204 when closing a temp group the individual files are left open, whereas on DPT they are closed, if appropriate, with either type of group). Ad-hoc groups are allowed in UL, and are effectively closed, and deleted, at the end of the request.
On DPT, the files making up a group do not have to have the same set of fields, but if they share any fields, the field definitions must be the same (in every respect except SPLITPCT and UP/UE, which control internal processing and do not affect anything which directly impacts on the user). This is slightly more strict than Model 204, where the system attempts to make the best of things. In most cases however, if field definitions differ, it is either a failure of DBA procedures or the fields are not actually the same thing, so to flag it up is in everyone's interest. Also it greatly simplifies the processing of operations like searches in group context.
To summarise: Field references only fail compilation if none of the group members contains the field. At run time in a FOR EACH RECORD loop, if the current record is in a file that does not possess a field, printing or otherwise reading the value will behave as if the field was simply missing from the current record, and attempting to update it will cause an error. In FIND processing, only group members with the field defined are considered as potentially having any records for the final set. In a STORE of course the UPDTFILE must have all the specified fields.
OPEN MYDATA //sets CURFILE OPEN MYPROCS //sets CURDIR
If this sounds confusing, in practice it is normally transparent to the user. Where there is a data file and a procedure directory with the same name OPEN, CLOSE and DEFAULT are special commands, operating on both "sides", (i.e. they may change CURDIR as well as CURFILE). To clarify what is going on, DPT issues distinct messages for procedure directories and data files, so these commands/statements may each sometimes generate two messages.
OPEN MYDATA BB.4001 Database file opened OPEN DEMO BB.3001 Procedure directory opened BB.4001 Database file opened
There are other small differences in the processing of the DISPLAY FILE command and the $ITSOPEN function. The Navigator pane in the client GUI also shows procedure and data files separately, and provides different command options for each.
PROCEDURE, DELETE PROC and most flavours of DISPLAY PROC work as usual.
Creating a "procedure file" is a trivial matter of creating a new OS file directory, which can be done with external utilities or using ALLOCATE ... NEW. Traditional file parameters such as PDSIZE are irrelevant.
The file name extension of the underlying text files representing procedures is by default expected to be ".txt", although this is a resettable user parameter, PRSUFFIX, and can also be overridden on a per-file basis. So when accessing procedures you never have to specify the extension (if you don't want to), and procedure names can look pretty much as they would on the mainframe.
As for the rest of the procedure name, a possibly inconvenient consequence of using actual OS files for procedures instead of implementing "table D" arises because of prohibited characters in file names. On Windows, file names may not contain any of the following:
\ / : * ? “ < > |Model 204 on the other hand disallows these in procedure names:
= , ; ' spaceNot so much as a slight discrepancy but a complete failure to overlap in any way! The convention adopted on DPT is that none of these two sets of characters are allowed in procedure names. DPT will not (currently) allow you to specify procedure names with quotes around them in any situation.
Since the CASE parameter will normally be set to *UPPER, most of the time the procedures you create will be named in uppercase. However, as mentioned earlier, case of will often not be relevant when referring to them by name.
The way it works when you OPEN a group is similar to the non-group case, in that the system attempts to open two group contexts with that name, one for data operations and one for procedures. If the group members are all valid directories and valid database files, both group contexts will open successfully, and both types of operations will become available. If only one or the other flavour opens successfully, that's still OK, with an error message being issued only if neither can be opened.
In procedure file group contexts the PROCFILE parameter is always effectively "*", meaning that the group members are searched sequentially for a procedure during INCLUDE, DISPLAY etc. processing. Any UPDTFILE specified is ignored by procedure-handling commands.
The decision to handle groups in this way has one or two minor complications. Firstly it is not possible to have a single group which contains a mixture of pure data files and pure procedure files, since it will not open completely either way. Secondly, since PROCFILE is always effectively "*", the PROCEDURE command and editor/SAVE do not work in group context.
For interest, the temporary procedures are stored in the "#USERTMP" directory, which is created at system start up, and deleted at shut down time. Each user's temporary procedures are stored in sub-directories of that, and are deleted when the user logs off. You should not store other files in there though, or the system may not be able to clear it down properly. For this reason the ALLOCATE command will disallow any attempt to allocate this directory or any files/directories inside it.
One of the front end disconnect options mentioned above makes it so that temporary procedures need not always be lost when a user thread ends (e.g. you accidentally cause a serious error which bumps you off and you lose half an hour's work in proc zero).
Semicolons in procedures
On Model 204 a semicolon in a procedure is indistinguishable from end-of-line, since the physical storage format in table D uses semicolons to represent line ends. (NB. ignoring here the LINEEND parameter, which does not exist on DPT). Many people make use of the behaviour of semicolons on a day-to-day-basis, for example to enter ad-hoc requests:
b;fpc;endequals
b fpc endor to force a line split in the editor, which happens in a much more visual way on DPT.
At the DPT command line, semicolons used as in the above example have the same effect, namely to "queue up" multiple logical input lines. Procedures on DPT (just text files remember) can contain semicolon characters, but when those procedures are included the semicolons have the same effect as they do on Model 204. However, the DPT Editor takes steps to avoid this situation if possible.
Other issues with procedures
Procedures can contain lines longer than STRINGMAX (default = 255) characters. Ths system issues a warning message when such a procedure is saved, and $RDPROC truncates the long lines when it retrieves them.
DPT provides a slightly simplified implementation of M204's large object facility (CLOBs/BLOBs/Table E). In some details the behaviour of DPT differs from Model 204, but it's not that different. The following notes clarify the similarities and differences between the two implementations. Further details can be added if anybody's interested.
The "BLOB" field attribute also implies STRING, and cannot be an indexed field. DPT does not support a "CLOB" variation like Model 204's, where the system converts to/from ASCII/EBCDIC at any stage - everything is ASCII. You can of course explicitly convert using $E2A/$A2E during extracts/loads, or use the equivalent option on DPT fast unload/load if they are involved anywhere. On a similar subject, using $LZW with BLOB data prior to storage might save some disk IO with large text fields on cheap disks, although it might or might not be more efficient overall depending on field access patterns.
BLOBs can be up to 2Gb each in size.
BLOB fields on DPT can be used as if they were normal STRING fields in nearly all respects. The following points give some additional information.
User Language:
The DPT parameter STRINGMAX should be set to a high enough value in order to run User Language programs that access BLOB fields, otherwise the long values will get truncated as they're manipulated by the UL runtime. Also it will not be possible to define sufficiently large string variables to hold them.
The standard STORE, ADD, INSERT, CHANGE and DELETE statements can deal with BLOB fields on DPT as if they were regular STRING fields. The M204 statement extensions dealing with table E reserve space and M204's "universal buffer" are not supported, since the implementation is not as complicated as that (see below). The extensions to the PAI and PAI INTO statements are however supported.
$Functions: $LOBLEN is identical to $LEN. $LOBRESERVE is not supported. $LOBDESC is a custom DPT function.
As previously mentioned, BLOB fields may not be indexed. It is however possible to perform FIND statements on them, which are considered to be "table B searches" and are controlled by the same parameters (MBSCAN) and other considerations, although of course table E disk reads would also be required. For a similar reason, changing or deleting occurrences of multiply-occurring BLOB fields is much more efficiently done by specifying the occurrence number rather than the old value.
The User Language pattern matcher will do its brave little best with BLOB data, but such operations won't necessarily be quick. Only the simplest of patterns are likely to be usable in certain situations, such as database finds against lots of very large objects.
Commands:
DEFINE FIELD, REDEFINE FIELD and DELETE FIELD work as normal. REDEFINE reclaims space from table E if the field becomes NBLOB. DELETE reclaims space from both table B and table E.
The TABLEB command and its flavours obviously only consider the descriptor value when analyzing table B pages. The TABLEE command is not supported (but see below for some other sources of information), and there is also no COMPACTE command.
=UNLOAD and =LOAD can handle BLOB fields although they may require a slight format variation if "PAI" mode is used.
Large object data is stored on pages in the "heap" file area, that is, the part that isn't table B. In other words DPT takes the same approach with "table E" as it does with "table A", which is that it isn't a separate file area but part of table D, along with everything else except the data. The terms "table E" and "table A" are however still useful, to mean the data substructures within table D which manage field descriptions and BLOB data respectively.
For BLOB field occurrences the table B record contains a 10 byte string value composed from binary integer representations of the primary table E page number (4), the slot number on the page (2), and the BLOB's full size (4).
Table E pages have the same layout as table B pages (i.e. a slot pointer area and a main data area). Like table B records, BLOBs can also have extensions making them span more than one page. Note therefore that this is different from M204 where each non-null BLOB field takes an entire table E page, even if it only uses a small part of it. DPT can therefore be thought of as more willing to efficiently handle BNSLOBs (binary not-so-large objects). There is no "ERECPPG" parameter controlling the density of BLOBs on table E pages. Or rather there is an effective parameter hardcoded at 32, meaning < 1% wastage for the slot area if BLOBS are large (more than one page each), and giving efficient packing down to BLOB sizes of around 256. In the special case of zero-length BLOBs, neither a slot nor any table E page space is used, although table B still holds the descriptor. The main drawback with the flexibility of this scheme is that if many BLOBs are in the size range 4K-8K some of them will end up spanning two extents instead of one. You win some you lose some - ERECPPG can be introduced if anybody wants it.
Since the layout of table E pages is very similar to that of table B pages, the same issues relating to item expansion, reserve space, and page reuse might be thought to apply. However, since typical BLOBs occupy a significant part of a whole page or many pages, and are always entirely deleted and rewritten during amendment, there would be no benefit in having resettable "ERESERVE" or "EREUSE" parameters, and the corresponding page use and reuse processing is hardcoded. (The trigger point is half a page).
The currently-active BLOB page is indicated for reference by the value of the EACTIVE file parameter, similar to DACTIVE and ILACTIVE. This is different from M204 where table E is a contiguous area, and has a currently-active page shown by EHIGHPG, analogous to BHIGHPG.
The "table E" pages within table D are not maintained in a contiguous area but are just mixed in with other things like btree nodes and inverted list bitmaps, as each is allocated. Also, DPT makes no specific attempt to keep BLOBs for the same record, or extents of the same BLOB, together, although in general usage both will often apply, and will always apply after a reorg. When all of table D has been used once and starts getting shuffled via the reuse queue, a reorg may improve BLOB access times if they're critical. Keeping BLOBs in their own file or group member with mininal indexing might also reduce such conflicts of interest in table D.
Diagnostics
In addition to the things mentioned previously, DPT also provides the following sources of data structure information:
Printing large BLOB values to various destinations is handled as follows.
Manipulating BLOB fields increments the same statistics as normal STRING or FLOAT fields - BADD, BCHG, etc. No separate stats are (currently) maintained when table E values are added or removed, although you will see DKPR/DKWR etc. registering the extra page accesses.
Access to table E is protected by the DIRECT CFR: it is considered part of the record data.
During transaction backout of User Language delete operations, reinstated BLOBs will almost certainly not have their previous locations in table E. In other words, the descriptor reinstated onto the table B record will be different from before.
Within the DPT host, file sharing between users is controlled using a shared/exclusive scheme based on the type of access required, and this works pretty much as you would expect. For example if one user opens a sequential file for image writing, they need an exclusive lock. Including a procedure requires just a share lock on its .txt file, and so on.
Within a server application such as DPT, structure locks can be implemented in a number of ways of varying sophistication, and a lock which maintains a user id, the time that the lock was placed, and other associated information like who's waiting to get the lock next, is more of an overhead on the system than one which simply grants or denies access with no explanations given. Therefore the system generally only goes to the extra trouble where it seems worth it, i.e. when information about conflicts might be of benefit for diagnostic purposes. On the other hand this "interest" factor has to be balanced against the desire to ensure that the work involved in setting up the lock is small compared to the amount of work done whilst under its protection.
The vast majority of internal locks are impemented in the "no frills" manner. Exceptions are made for things such as procedures, groups, and a small set of file structure locks for each file, analogous to the famous Model 204 "critical file resources" or CFRs. Because these locks are so well known, DPT uses corresponding names (DIRECT, EXISTS etc.) although it should be noted that this does not mean the locking behaviour is exactly the same as on M204. All such "higher" locks show up on the output from the custom MONITOR RESOURCE command.
Record locking by and large works as per M204 record locking. This includes the necessity for an exclusive lock on records during all update statements (which may or may not then be released at END FOR depending on the TBO/LPU setting), and the necessity for a share lock on all (usually) records in the file at the start of finds containing table B searches. It also includes the system-generated "hidden" share lock used to guarantee a record's integrity during PAI.
DPT will also on occasion report "sick record" if you work with unlocked records, especially but not exclusively using occurrence processing.
"Procedure files" (i.e. OS directories) are not enqueued at all. When accessing an individual procedure this is not a problem, because the OS enqueue on the .txt file will prevent the directory from being deleted. However, after "opening" a procedure directory, there is no guarantee that the directory will not be deleted outside of DPT before it is actually used.
On M204, if SEQOPT is 1, whenever a user thread requests a page, the disk buffer system retrieves and returns the page to the user, and then immediately initiates another asynchronous retrieval for the next page. The plan is that when the user thread has finished with page A, page B is ready and waiting, or at least closer to being ready than it would otherwise have been. On DPT, all disk reads are performed synchronously by user threads (but don't block other threads - this is an implication of using OS threading facilities). Requesting the next page at the same time as the current page is however still worthwhile in a different way, since retrieving two pages at the same time from disk is usually faster than two separate disk reads, especially if the disk head would have been moved elsewhere on the disk in between the two separate reads.
In fact the benefit of this applies not just to double-size chunks but to larger multiples as well, so on DPT SEQOPT can go up to 255. In other words if SEQOPT is 7 each physical disk read (DKRD) will actually read in 8 pages, and in ideal conditions DKRD will show 1/8 of what it would have done. During testing of DPT, increasing SEQOPT continued to deliver (admittedly decreasing) benefits for values right up to 255. In practice you would need to experiment to determine an appropriate value for any given situation.
When to use SEQOPT
Interestingly, in a single pass across, say table B, with no other I/O between reading each page, SEQOPT seems to give no benefit. This is probably because the disk head is always positioned at the right place as each new page is required, and the small physical overheads of transferring the page data seem to balance out the small DPT internal administrative overhead of managing SEQOPT. However, in a typical program there will usually be some non-trivial processing that happens inside the loop that's driving the I/O, and causes the disk head to be moved. On personal workstation pretty much everything will be on the same disk, so accessing any other file (assuming DKRD is required) or even just writing a line to the audit trail can cause disk head movement, and SEQOPT should help.
SEQOPT can also help with index processing such as b-tree walks and searches, but only under conditions where there is good localization of logically adjacent b-tree leaves, and/or inverted list pages for each value. For example if a large file has been loaded using deferred index updates, these conditions would be met and SEQOPT would be worth a look.
SEQOPT is a file parameter, which means in multi-file processing it might be worth tweaking for some files and not others, or to different degrees for different files. It can be reset during a run, the new value taking effect for all physical reads on the file from that point on.
Various space-reclamation processes occur in the heap area, containing as it does many different data structures. These processes are mostly under automatic control.
You can see the current extent structure of a record for interest using PRINT *RECINFO in a record loop. This is a custom variation of M204's PRINT *RECORD.
The sort utility must therefore be able to handle variable length records that are CRLF-delimited. Because of the position of the field ID and value you can define them as a single key starting at position 6 and reaching to the end of the record.
Generated in response to the NOCRLF option on the OPEN command. This format, introduced in V2.10, allows field values which might contain CR or LF characters within them - for example purely binary data. These values would otherwise cause the load to fail.
Using this format demands that the sort package would understand records with the unconventional "length byte", or that you wrote your own sort code.
This is the same as format A1 above, with the numeric value expressed as an ASCII string. Records are written in this format instead of the default N3 (see below) if an appropriate option is given on the OPEN command. The value of N can be specified on the command or left to vary.
Generated as an alternative to N1 if the NOCRLF option was given on the OPEN command - same idea as with A1 vs A2. Probably less useful than A2 but included for consistency.
All 3 values in this record format are binary, not strings. However, since the field IDs are only being sorted to divide up the file, and their actual value is unimportant, it does not matter how that two-byte portion is interpreted. Getting the sort to treat it as a two-character string will most likely be just as efficient as a 2 byte integer. Following on from that, bear in mind that IEEE floating point values can not be sorted as character strings, except under certain special conditions. You should consult technical documentation if in doubt, but this is mentioned because a common case, where values are all positive integers or zero, would satisfy those conditions and allow an efficient sort using a single "string" key of {5,10}). Otherwise, the sort utility must be able to interpret floating point values properly, and must be given two keys: ({5,2} and FP{7,8}).
A common usage of the fast load feature might be to extract data from Model 204 and load it into DPT. DEMOPROC in the DPT demo download contains a simple User Language program which creates acceptable extract files (in the simple "PAI" format). With more effort you could develop extracts in the lower-level formats, which would create smaller extract files (in the TAPEI case much smaller) and would be faster to load into DPT.
Each of the sequential files can contain any amount of header/comment text, typically CRLF-terminated lines, which must be at the very start of the file. The comment block must start with a CRLF-terminated line of 20 or more asterisks, and is considered to end at the next line of 20 or more asterisks. This section is where to specify any non-default format/endcoding options that were used to create the data in the file, for example as covered in the =UNLOAD command notes. In other words these files are self-describing. The comments section is optional - all files can just start straight away with the main data if desired.
To specify options, use the appropriate keyword prefixed with a "+" (activate) or "-" (deactivate), anywhere in the header area. Since all options are off by default, only "+" will actually have any effect. For example the =UNLOAD command writes a header block something like the following in all the files it creates:
****************************************************************** * DPT fast unload file generated on 1st January 2010 at 12:00:00 * * File SALES, index for field ACCOUNT ID * * Format options: -FNAMES -NOFLOAT +ENDIAN +EBCDIC +CRLF -PAI * ******************************************************************
A fast load starts off assuming input files contain ASCII, but the lines-of-asterisks and the option keywords are recognized in EBCDIC too (according to the translation table currently in effect). So a file generated on the mainframe containing all EBCDIC can still have its EBCDIC data processed correctly, since the presence of "+EBCDIC" in the header block will activate translation for the actual data.
During a reorg DPT doesn't bother converting the field codes on table B pages into literal field names and back, and this makes for faster processing and smaller intermediate files. When feeding custom data into a fast load, you can make use of this feature too and it will give the same speed/file size benefits, at the cost of slightly more complex set-up. Generally however it's clearer to use field names when transferring data between systems. This is controlled by the FNAMES option.
All numeric values when expressed in binary form, such as record numbers, value lengths, field codes etc. are assumed to be *unsigned* binary values, since in many cases the full positive range is required for correct operation.
Transferring data to/from Model 204 raises several issues about the encoding of data:
In its simplest form this file should contain standard DEFINE commands, which may be continued across lines with hyphens. E,g.:
DEFINE FIELD SURNAME (STRING ORD CHAR) DEFINE FIELD CUSTOMER_ID (WITH ORD CHAR SPLT 99)
This format was chosen as easy to generate on Model 204 by simply issuing the "D FIELD (DDL)" command, and easy for DPT to handle by just treating it like incoming DEFINE commands. Unsupported M204 field attributes like KEY and FRV are ignored and will not break the load.
A slight variation is when using field codes in the TAPED file (see next section), in which case give them here in TAPEF immediately before the names, as decimal strings or X'FF' style hex strings.
DEFINE FIELD 1 SURNAME (STRING ORD CHAR) DEFINE FIELD X'0002' CUSTOMER_ID WITH ORD CHAR SPLT 99
The information in this file describes either the database records that were unloaded or the ones to be loaded. The default layout is shown first, followed by a plain text "PAI" style variation which is sometimes more convenient to work with.
The indentation above is just for readability - the actual file contains no tab or space indentation.
Record numbers:
During a load the record numbers in the input are not preserved. The newly-stored records will have numbers as per the current BRECPPG for the file and whether any records were present when the load started. The input record numbers only have relevance if a TAPEI file is to follow, in which case they are essential and correspond to the record numbers contained in the inverted lists/bitmaps. In the file produced by a fast unload, the record numbers are those of the primary extents, and may or may not be of interest depending on what you're doing with the extract.
Note that fast load is purely a record-storing process. Supplying record numbers in TAPED will not make it find and amend an existing record. User Language must be used to do that kind of thing if required.
Field codes:
If you use field codes (instead of names) any codes are OK so long as they're unique, in the range 0-4000, and match the ones in TAPEF mentioned earlier. DPT will be allocating fresh field codes as part of the load anyway. When in doubt, go with the field names option which makes things clearer if slightly slower.
The order of the field/value pairs on each record is preserved when loading.
The end of the record is denoted by x'ffff', since that sequence is invalid to start the next FV pair in either format. You can also use the "CRLF" =UNLOAD option to specify an extra CRLF byte pair at the end of each record. This can be handy if fast unload is being used to create a general extract with the intention of passing it on to other utilities like sorts and so on. It can also make it easier to create custom files to pass into DPT fast load.
In general the above "compressed" TAPED format is exactly as data is stored in table B, making for highly efficient operation in e.g. file reorgs..
BEGIN FOR EACH RECORD PRINT $CURREC PAI PRINT END FOR ENDThis kind of output is quick and easy to generate on Model 204, and is readable, with no binary data items. The lack of binary data can also make it easier to FTP the files around with less fiddling of settings like for EBCDIC translation. On the downside this format is less efficient going into and out of DPT, requiring more reformatting and rearranging to fit the internal data structures.
PAI format consists of CRLF-delimited text lines (usually, see below). Field/value pairs are on the same line, separated by the 3 characters " = ". Records are separated by a blank line.
The only other format option relevant to this mode is EBCDIC. Then the text, spaces, equals sign and decimal digits are converted by DPT to EBCDIC (unload) or from EBCDIC (load). CRLFs are not data as such, and are always X'0D0A', as per the earlier comment.
PAI mode for fields containing newline characters
This variation is required since newline characters in field data will mess up the CRLF-delimited format. It is primarily intended for BLOB fields but can be used with any field. As with Model 204 the variant format is produced by the PAI statement LOB_FLOD option. For example a BLOB field called MYFIELD, containing a BLOB 10 bytes long, which is all "X" characters, looks like this in regular PAI and alternate PAI format:
MYFIELD = XXXXXXXXXX MYFIELD =10=XXXXXXXXXX
In other words the length is enclosed in two equals signs, with no spaces except the one after the field name. This causes DPT to read the field value as a specific number of bytes (10 in this case) instead of searching for newline delimiter characters. There should however also be a CRLF sequence after the data bytes.
Each TAPEI file contains index entries for a single field. The default layout is shown first, followed by a plain text "PAI" style variation which can be more convenient to work with.
TAPEI files are optional in both fast unload and fast load. If they are not present for some or all fields during a load, DPT will generate index entries from the field=value pairs in TAPED. If loading data from external system it may be more efficient to do that anyway, avoiding the time taken to prepare and transfer the index information to the DPT machine.
The following information is repeated for each value of the field:
Values:
Numeric values may be in string format if the NOFLOAT option is used. Values should ideally appear in the TAPEI file in the order they will end up in the final b-tree. A load will work if some or even all values are out of order, but more slowly and with less satisfactory final results in terms of b-tree structure.
Record numbers:
The numbers in this file correspond to those in the TAPED file if both are being processed in the same load. Therefore depending on the circumstances, DPT may or may not need to adjust them, for example to take account of TAPED input records receiving new numbers because of reclaimed record slots, BRECPPG changes, etc.
File pages are 8K on DPT, unlike M204's 6K. So in the above layout, the term "segment" means each group of 65280 consecutive record numbers, and bitmap-style inverted lists are 8160 bytes in size. A record's file-relative record number (Rf), as you would print with $CURREC, is known from its segment number (S) and its segment-relative record number (Rs), by taking Rf = (S x 65280) + Rs. The first segment is segment zero; the first record is record zero.
Terminators:
The optional CRLF sequence after each segment entry is controlled by the "CRLF" option. This option is not for readability purposes but to make it easier to manage extract files which might otherwise contain records of many Mbytes each. For example it becomes more straightforward to extract index data from Model 204 in a User Language image, which can only map 32K of data.
The optional 4 byte value terminator should be used in cases where the eventual number of distinct segments that will contain a value is not known when starting to create the TAPEI data for that value. In that case use a large number (x'ffff') for the segment count, and the load will move on to the next value when it hits the terminator, rather than after processing a set number of segments.
Like TAPED, the above "compressed" TAPEI layout is similar to how information is held within a DPT file, meaning that in many situations minimal conversion is required during unload and reload, and things can be done with efficient page-level disk operations.
A program like this is quite resource-hungry, so depending on the relative power of the machine running it and the target DPT machine, it may or may not be more efficient to let DPT build the indexes again from the base data in TAPED, forgetting about TAPEIs altogether. With certain invisible fields that may be necessary anyway.
BEGIN V: FOR EACH VALUE OF MYFIELD PRINT VALUE IN V FOR EACH RECORD WHERE MYFIELD = VALUE IN V PRINT $CURREC END FOR PRINT END FOR END
During a load, the above TAPEI layouts are valid either way regardless of the number of records, and the load processing will promote/demote as required before storing in the database. However, the format chosen will have a big effect on the TAPEI file size. Imagine a segment where every record possessed a particular value. That would require an inverted list of 65280 2-byte entries, or over 130K in "array" form, compared to 8K in bitmap form (a factor of x16 increase). Furthermore, in "PAI" style each inverted list entry (plus CRLF) requires say 6-10 bytes in a typical file, so that's another factor of x4 or x5. So if you're actually using TAPEI and the size of the file is a problem, it may be worth going to the trouble of generating loadable data in the more compressed formats.