Every day MERLIN is used to process extremely large datasets. For example, one client used it to analyse 130 million data records – and there is in fact no practical limit to the number of records that can be read, or to the number of data files in which they are contained. Some of the limits that MERLIN does have are so large that we have never known anyone to break them – for example, the physical length of ASCII or CSV data records can be anything up to 20 million bytes.
Nevertheless, the size of projects is continually growing and it is inevitable that MERLIN users do sometimes encounter limitations – but this blog is designed to show you how to work around some of the most common ones and, if you still have problems, see the final section below about ad-hoc solutions.
Too many fields in CSV/XLSX data records (solution 1)
MERLIN can process the first 9999 fields of a CSV data file (or columns of Excel data).
The most common reason for files having more fields is that multi-coded questions are stored in “bitstring” rather than “spreadfield” format. Bitstring format uses one field for each response, whereas spreadfield format uses just one field for the whole question – so if, for example, Q10 has six responses of which the 1st, 3rd and 4th are true, it would be stored like this:
Bitstring format 1,0,1,1,0,0 (in six separate fields)
Spreadfield format “1,3,4” (in a single field)
Note that in Excel files, it is unnecessary to include the surrounding quotes, so the spreadfield cell would contain simply 1,3,4.
The spreadfield format is clearly preferable – not just because it uses less space, but it is also easier to process: the “header” in line 1 of the file can identify the single field as “Q10” (for example) as opposed to “Q10_1, Q10_2, etc”, and the MERLIN variable definition will be far simpler.
The MERLIN Toolkit “tools” menu contains a utility to “convert CSV/XLSX bitstrings to spreadfields” (see blog #12) and this will often result in a data file that doesn’t exceed the MERLIN limit.
Too many fields in CSV/XLSX data records (solution 2)
If, having tried the above, the file still has too many fields, the next step is to split it, e.g. into fields 1-1500 and 1501-2000. This can easily be done in Excel by selecting and cutting fields 1501-2000 and pasting them into a new file (or maybe a new worksheet within the same file, since MERLIN allows you to specify which worksheet you wish to read). To display column numbers in Excel, rather than letters, select “Options > Formulas > R1C1 reference style”. Note that there is no need to include any questions in both files (not even the serial number, although it is good practice to do this), since they will both contain the same records in the same order, ready for the next step…
Create one MERLIN script to read fields 1-1500 into an “idf” (Internal Data File) and another one to do the same with fields 1501-2000 – but the second script should also read the “idf” created by the first one, using RCP (Run Control Parameter) “IDF=SCRIPT1” as well as “INPUT=FILE2.XLSX”. This causes the two files to be read in parallel and you will end up with a single “idf” that contains all the data. If it is necessary to split the original data file into more than two parts, the same principle applies, and each MERLIN script will add more variables into the “idf” created by the previous one.
The RCP “USEHEAD” allows you to reference fields by the name in the file’s header, which is much easier than using the field numbers which may change.
Too many variables
MERLIN can store up to 32000 variables in an “idf”.
Sometimes this limit is exceeded by users creating permanent variables in the “idf” that could be temporary variables. Temporary variables start with the letter “X” and are not permanently stored so, if you are creating one variable only as a step towards creating another, you should make the first one temporary. As well as saving space in the “idf”, it means that if you export the project to another package such as SPSS, the database won’t contain variables that aren’t needed by the end-user.
Note that you can “re-use” the same temporary variable as often as you like, providing its type and size is unchanged. If, for example, you are doing many complex arithmetic calculations that involve the creation of a temporary integer variable, it is more efficient to re-use the same variable name each time.
It should be noted, however, that although temporary variables are not permanently stored in the “idf”, they are kept there until the end of the current data stage. If, therefore, you have more than 32000 variables in a data stage, you may be able to overcome this by (a) changing as many of them as possible into temporary variables and (b) moving variables 32001+ into a second data stage within the same script. In this way, the space used by temporary variables in the first data stage will become available for permanent variables in the second one.
If a project has more than 32000 permanent variables, it can be analysed as two separate projects, but with key variables such as serial number and demographics in both. You can then join the two sets of tables into a single file, as described in the next section below.
Too many tables
A single MERLIN run can output up to 32000 tables. If you require more than this, you can generate separate files in different MERLIN runs, then join them together. If you are using automatic table numbering, you can set the starting number for each set of tables using format STN.
If your tables are in Excel, you can join them into a single file using the MERLIN Toolkit “tools” menu utility to “join Excel tables vertically”. To ensure the table numbers are unique they will be renumbered but, if you know this is already the case, you can switch this option off.
Too many columns in Excel tables
MERLIN tables may each have up to 32000 rows and 1500 columns. In practice, the number of rows isn’t really limited as MERLIN tables can be joined vertically (using +T syntax) into a single output table – but the columns limit can sometimes be a problem.
Let us suppose, for example, that every table is to be analysed by 2000 sales areas. To do this, run the same script twice, first analysing by areas 1-1000 and then by areas 1001-2000. The two sets of tables can then be joined using the MERLIN Toolkit “tools” menu utility to “join Excel tables horizontally”.
Stage too large
This section is relevant if MERLIN displays an error message saying that the script is too big to compile or execute, or that too much space is required for tables, which are all different symptoms of the current stage being too big for MERLIN to handle. This doesn’t relate to a specific limit such as the number of statements, but to the overall storage space required.
The error message may suggest that you increase the setting of RCP MEMORY which has a default setting of 1 but may be increased to 2, 3 or 4. You are advised to increase this value one step at a time because setting it to a higher value than needed may make your MERLIN script, and other processes on your PC, run slower.
Another solution is to split the stage, e.g. if the error occurs in a data stage, first move all the tables into a tables stage. If the data stage is still too large, create two consecutive data stages – and similarly for tables stages. The most common cause of table stages being too large is that the number of rows assigned to frequency tables is unnecessarily large, e.g. you have specified T#20 = $ivar(9999) * $banner. The number used here should never be greater than the number of respondents in your data and should ideally be set much lower; if the number is too small, MERLIN will inform you this is the case and you an increase it. If you use format TNP, this can save making the same change in many individual TABLE statements.
A single MERLIN run can have a large number of table stages, and this makes no difference to the output, but the more stages you have, the slower the run will become. A large number of stages is usually (but not always) a sign of inefficient coding, such as using lots of IF statements to code open-ended data, rather than using %LOOKUP (see blog #6).
Adhoc solutions
If you have a project that exceeds MERLIN limits (especially if it is a repeat project), it may be worth speaking to us about it. As well as advising on solutions within MERLIN, we may be able to solve the problem by adhoc programming outside of MERLIN – something we have done for a number of clients.
Any questions? Email support@merlinco.co.uk