This blog looks at how to analyse additional data in MERLIN that isn’t in the main survey data file – a typical example being codes created from open-ended questions and entered in a secondary file together with serial numbers that match the main data file.
Merlinco software offers many different ways of dealing with this situation, and our object is to discuss the pros and cons of the main approaches so you can select the most suitable. We won’t be describing each method in detail, but will point you to the relevant features in the Help files.
IF statements – the worst method!
A common approach is to edit the secondary file to produce a series of IF statements which are then incorporated into the MERLIN script, specifying the action required for each serial number, e.g.
IF $IOBS/1, D $91-93=546,
IF $IOBS/2, D $91-93=701,
IF $IOBS/3, D $91-93=192,
…
This method is extremely inefficient and, if there are thousands of serial numbers involved, can create huge script files and cause limits to be exceeded. It is worth noting that this approach would be far more efficient (and run faster) if the code was changed as follows…
IF $IOBS/1, THEN, D $91-93=546,
ELSEIF $IOBS/2, THEN, D $91-93=701,
ELSEIF $IOBS/3, THEN, D $91-93=192,
…
ENDIF,
… but it would be even better not to use this approach at all!
Joining files with MERLIN Toolkit
On the Tools menu in MERLIN Toolkit, there is an option to ‘Join data files horizontally’, which enables you to match records in the main and secondary files, and append the latter to the former.
This approach is quick and easy to use, and means you can then forget the secondary file and simply define extra variables from data locations at the end of the record. It doesn’t matter whether or not secondary data exists for every record in the main file – and the program can also cope with any secondary data records not matching the main file – but there are two disadvantages:
a) You have no control over which secondary file data items are added, or where they go in the main file – the entire secondary file record will always be appended to the main data record. One consequence of this is that the serial number will be in two places in the record – which doesn’t matter in itself but, if the secondary records are very long and you only want some of the data, it is generally inefficient.
b) You may only use ASCII or CSV data files, and both files must be the same type – so you can’t append data from a CSV file into an ASCII one, and cannot use other file formats at all (although you could save Excel files in CSV format of course).
Using two data streams in MERLIN
MERLIN allows you to define two input streams to be read side-by-side – so you can use INPUT(A) for the main data and INPUT(B) for the secondary data. In effect, this does the same as MERLIN Toolkit Horizontal Join but within MERLIN analysis run – although you can, if you wish, output a new data file (containing whichever data you wish in any locations you wish) to be used in subsequent analysis runs. There are, however, several disadvantages:
a) As with Horizontal join, the two streams must both contain ‘freefield’ or ‘fixed location’ data files – although, unlike Horizontal Join, you can use Excel or card image files.
b) The files in both streams must be in serial number order (or whatever the matching key is).
c) Writing the MERLIN script to use two input streams is fairly simple if both streams contain the same serial numbers but, if there are any mismatches, it becomes more difficult – especially if either stream can contain serial numbers which are not in the other!
In most cases, MERLIN Toolkit Horizontal Join is preferable to this method.
Reading secondary data alongside MERLIN IDF
A variation of the above approach is to read the secondary data file alongside an IDF (MERLIN’s Internal Data File) previously created from the main data file. This is done by specifying both ‘IDF=’ and ‘INPUT=’ in the MERLIN Control Stage.
This saves re-reading the main data file, and allows you to use any type of secondary data file. Other than this, however, it has the same disadvantages as using two data streams – plus the fact that there is no way of sorting an IDF so, if it isn’t in serial number order, you can’t use this method.
The MERLIN %LOOKUP and %GETLKFD functions
This is the most flexible approach.
Up to 124 LOOKUP files may be defined in the MERLIN Control Stage, and they may be CSV, tab-delimited, or Excel files – irrespective of the format of the main data (which could be an IDF). Neither the main data file nor the LOOKUP file needs to be sorted in any particular order.
Using the %LOOKUP function you may read whichever fields you wish into a variable or a raw data location. For example, the following statement searches LOOKUP file (A) for the record where field 1 matches the serial number ($IOBS) in the main data file, then puts the contents of field 10 into the variable $IVAR.
DI $IVAR = %LOOKUP(A,$*10,$*1=$IOBS),
You may refer to LOOKUP file fields by number or letter or (if a header row is present), by name. You may use any of the six relational operators when testing for a match, e.g.
DC $CVAR = %LOOKUP(A,$*’area’,$*’ref2′.GE.$REF1),
Assuming you have your secondary data in a ‘freefield’ file (or can convert it into one), the only significant disadvantage of %LOOKUP is that it can be somewhat slow if very large data files are involved – but, if this is an issue, you can output a data file containing the secondary data to save using %LOOKUP in each run.
Finally, if the matching criteria are more complex, e.g. there may be more than one matching record and/or the match depends on more than one variable, the %GETLKFD function gives you even more flexibility, and allows you to write your own script for reading the LOOKUP file.
Any questions? Email support@merlinco.co.uk.