MERLIN Toolkit is supplied free with MERLIN and MERLINPLUS, but can also be purchased on its own by non-MERLIN users.
Its main function is to provide a user-friendly bridge between MERLIN and other software packages so – for example – a MERLIN user can import from a web interviewing package (via Triple-S), or export to SPSS. Someone not using MERLIN, however, can get from the web interviewing package to SPSS without doing any MERLIN processing in between, or convert from one Triple-S version to another. For a full list of MERLIN Toolkit imports and exports, refer to the Survey Software section of our website.
This blog, however, concentrates on the lesser known Tools menu, which contains a number of useful utilities.
Join data files vertically / horizontally
These utilities combine separate data files into one, and can be used with ASCII, CSV or “cba” (column binary) data.
Vertical Join simply places all the records from separate files into a single file – whether for convenience, or because you are using software that requires this.
Horizontal Join is used to combine records that belong to the same respondent, but are held in separate files with matching keys – e.g. pre-coded and open-ended data with matching serial numbers can be combined into a single file (typically by adding the open-ended data on the end of the pre-coded records). The user can choose whether or not to include unmatched records from either file.
Join Excel tables files vertically / horizontally
These utilities combine separate Excel files of MERLIN tables into a single volume – the original files having been generated with MERLIN Run Control Parameter XLSTAB set to 1 or 3. The files may be separate because they were generated at different times, or to overcome size limitations, or simply because it was more convenient – but you now wish to put all the tables in a single file
Vertical Join puts all the tables into a single volume without modification (except for renumbering them if required), and generates a new Table of Contents.
Horizontal Join is used with sets of tables that are identical apart from the banner (horizontal axis), and combines these separate banners into one.
Convert CSV/XLSX bitstrings to spreadfields
This utility reads a “csv” or “xlsx” data file in which the multi-coded variables are coded in bitstring format (e.g. 1,0,0,1,1 in separate fields), and outputs a new file of the same type in which these variables are in spreadfield format (e.g. “1,4,5” in a single field). This can greatly reduce the size of the file, overcome limits, and make it faster to process.
The utility works by examining the field names in the header row and looking for names containing at least one underscore character “_” and the last underscore (if more than one) is used to split the name into a prefix and a suffix. If consecutive fields have the same prefix, and all the cells in those fields contain 0, 1, space(s) or blank only, then those columns will be merged into one spreadfield with the prefix as the field name.
If, for example, the input file contains consecutive fields named Q5_A, Q5_B, Q5_C, Q5_D all containing 0, 1, space(s) or blank only, they will be replaced by a single field named Q5.
DMERGE
DMERGE is a utility for combining two or more “idf”s (MERLIN internal data files) into a single “idf”, whether or not they are compatible.
DMERGE is mainly used so MERLIN can analyse data from incompatible “idf”s – especially useful for merging data from several waves of a survey. It is also possible to select variables from each of the input “idf”s, so DMERGE could be used with a single input “idf” simply to output a new one with fewer variables.
The output “idf” will contain all the records from all the input “idf”s, and these new records will (by default) contain all the variables defined in any of the input “idf”s. Variables that are not defined in a particular “idf” will be empty.
Any questions? Email support@merlinco.co.uk.
In blog #10, we described how to generate MERLIN tables in “xlsx” files, and how they can be enhanced by creating your own styles file. In this blog we will look in more detail at possible enhancements, using this example (which was generated directly by MERLIN without any post-processing).
For general information about styles files, refer to blog #10, and for more detailed information about the commands, see the MERLIN Help file > User files > “txt” Excel styles file.
A styles file commences with global commands which control things like the width of the row label and data columns, and whether commas are shown in thousands (e.g. 2,385). In our example, we have changed the following global commands from their default settings:
TOCLINK=ON | create hyperlink from each table back to Contents sheet |
TOCLINK COLUMN=B | column to contain above (see cell B1) |
LOGO=MERLINCO.PNG | file containing logo graphic |
LOGO COLUMN=L | column to contain logo in Tables (see cell L1) |
LOGO TOC COLUMN=D | column to contain logo in Contents sheet |
BANNER COLOUR ROTATION=1 | header level at which colour rotation operates (see row 6) |
FREEZE=ON | freeze panes containing table axes (see columns B-C and rows 1-9) |
GRIDLINES=OFF | do not show Excel cell gridlines |
CHART TYPE=HORIZONTAL BAR | generate horizontal bar chart (see rows 23+) |
The global commands are followed by style definitions, which control things such as the font, cell colours and borders, text wrapping or shrinking. The style definitions form a hierarchy as shown in the following chart (information in brackets relates to our example):
So, BASALL defines the style for the entire table but this can be overwritten by BASTOP for titles at the top, then by the individual components in that section such as LBJ (lines before job header) and JH (job header). See MERLIN Help file for a full list of the individual components.
Each style definition starts with its name in square brackets, e.g. [BASALL], followed by style definition commands.
The following commands may be used within each style definition (but it is only necessary to specify them if non-default):
FONT FACE=fontname |
FONT SIZE=n |
FONT COLOUR=colour |
FONT BOLD=OFF / ON |
FONT ITALIC=OFF / ON |
FONT UNDERLINE=OFF / ON |
TEXT SHRINK=OFF / ON |
TEXT WRAP=OFF / ON |
CELL BORDER STYLE=NONE / DASHED / DOTTED / THICK / MEDIUM / THIN |
CELL BORDER COLOUR=colour |
CELL COLOUR=colour / colour / colour … |
CELL COLOUR ALTERNATIVE= colour / colour / colour … |
JUSTIFICATION=CENTRE |
FIGURES=LABEL |
The CELL COLOUR ALTERNATIVE command may only be used in BASROW and BASFIG and their individual components.
The JUSTIFICATION=CENTRE command may only be used in BASTLC, BASBAN, BASROW and BASFIG and their individual components. Otherwise, all text is left justified.
The FIGURES=LABEL command may only be used in BASFIG and its individual components, signifying that the style definition is to be copied from the corresponding style in BASROW, e.g. if used in UNR_FIG, the definition will be copied from UNR.
Colours may be specified using three different methods which may be mixed freely:
Hexadecimal and RGB values may be found on the Microsoft Word Custom Colours form.
In our example, we have used FONT COLOUR=BLUE in the VT style to show the Vertical Title in blue (row 4).
The CELL COLOUR command (only) may contain two or more colours (using any of the above methods) separated by forward slashes, e.g. CELL COLOUR=RED / GREEN / CYAN / GREY. Multiple cell colours are only relevant in BASROW, BASBAN and BASFIG and their individual components (they will not actually be used in any BASROW styles, but will be picked up if FIGURES=LABEL is used in BASFIG styles). Multiple colours will be used in rotation across the banner as specified by the global command BANNER COLOUR ROTATION. A setting of 0 means it will change on every column, 1 on every header, 2 on every overheader, etc. In our example, the colour changes on every header (row 6) and this effect is continued for all rows beneath (rows 7-20).
The CELL COLOUR ALTERNATIVE command allows you to use different colours on alternate rows of styles BASROW and BASFIG and their individual components. The alternation works within the individual components, so in our example we have included these commands in the RTV style definition…
CELL COLOUR=#5FEB69 |
CELL COLOUR ALTERNATIVE=#C4F8C8 |
… to produce the two shades of green, and these commands in RTV_FIG…
CELL COLOUR=#71E4F3 / #FFFF69 / #FFB343 / #E99BB9 |
CELL COLOUR ALTERNATIVE=#BDF2F9 / #FFFFCD / #FFDDAB / #F9E3EB |
… to produce the two shades of blue / yellow / brown / purple (used in rotation as defined by BANNER COLOUR ROTATION=1).
We have also used CELL COLOUR commands to specify the colours of each of the blank rows, and the Mean Score row. No ALTERNATIVE is needed, since these rows only occur once within each table.MERLIN formats PSV and SCV
If MERLIN formats PSV (Print Score Values) and SCV (Separate Cell for score Values) are used, the score values will appear in column C, and the figures in columns D+.
When this is the case, column C is treated as belonging to BASTLC and BASROW and their component styles, so it will have all the same characteristics as column B. The global FREEZE command will also be adjusted to include column D.
Only one chart type is currently available (HORIZONTAL BAR). If a table contains any column percentages, and you are using RCP XLSTAB=3 (one table per worksheet), this chart will be generated below the table using the figures from column C (or column D if formats PSV/SCV are used). This will usually be the total column but not always – so you may need to delete or change charts that are inappropriate.
We will add other chart options if there is sufficient client demand.
Any questions? Email support@merlinco.co.uk.
MERLIN can generate tables in “xlsx” files so you can open them in Microsoft Excel – whether you simply want to view them, or maybe create charts. MERLIN’s creation of Excel tables is completely independent of Excel, and may be done without Excel being installed. There are numerous options controlling the output and appearance of these tables, which are introduced in this blog.
Tables and worksheets
Output of tables to Excel is controlled by Run Control Parameter (RCP) XLSTAB, which has three possible settings:
XLSTAB=1 all tables in one Excel worksheet called “Tables”, one after another
XLSTAB=2 each table in a separate Excel file named after the MERLIN table ID
XLSTAB=3 each table in a separate Excel worksheet named after the MERLIN table ID
Settings 1 and 3 also generate a worksheet called “Contents” containing a Table of Contents (TOC) with hyperlinks to the tables themselves. No TOC is generated by setting 2.
All settings automatically suppress output of tables in other formats, except you may specify RCP TABSML if you also want tables in TABSML format.
Plain Excel output
By default, MERLIN produces “plain Excel output” like this:
You will notice that the banner labels (including the header “Class”) are left justified, since that is Excel’s default treatment of text, and zero cells are shown as “0” instead of the usual “-”, since the latter would also be treated as text. This plain output is suitable if you just want to view the figures, or maybe create charts, and the appearance of the tables themselves isn’t important – but it is usually preferable (and very easy) to produce enhanced output.
Enhanced Excel output
By setting RCP XLSID, we can produce “enhanced Excel output” like this:
XLSID causes an “identifier flag” to be placed in column A (hidden by default) which enables MERLIN to format specific rows differently to others. So, by default, a different font is used for the job header, table title and banner labels – and the latter are also boxed and shaded. This is all specified in a “styles” file called “xlstab_std_styles.txt”, which is supplied with the software and used by default – but you can create your own styles files as explained in the next section below.
If you always want to use XLSID with Excel tables, simply set it in your global RCP settings file (MERLIN allows users to define default settings that will be used on every project you run).
Creating your own styles
To create your own style, copy the default styles file from the software installation folder to a folder of your choice, and give it a suitable name, e.g. “my_styles_1.txt”. You can then edit this file, following the explanatory comments to change or insert simple commands relating to the various parts of the table, e.g. there is a line that reads “freeze=off” but, if you change it to “freeze=on”, the Excel “freeze panes” option will be applied so both axes of the table remain visible when you are scrolling. Similarly, you can set “toclink=on” so each table contains a hyperlink back to the TOC.
Cells and fonts can be changed to standard colours, e.g. the banner cells to yellow, the raw numbers font to red, and the vertical percentages font to blue – or you can specify any colour using RGB or hexadecimal values – plus there are many other options, fully described in the MERLIN Help File > USER FILES > “txt” Excel styles file.
To apply your own styles file, use RCP XLSTYLES, e.g.
XLSSTYLES= my_styles_1.txt
Here is a table produced using a styles file, with the changes described above:
MERLIN formats
MERLIN itself has hundreds of formats and label controls affecting tabular output – but not all of these are relevant to Excel tables, and some settings will be ignored. These include all that control page breaks, text spacing, indentation and justification – since these are irrelevant, or can be specified in the styles file. See the MERLIN Help File chapters about FORMATS and LABEL CONTROLS for full details of which settings are applicable or not. There are, however, some formats that specifically relate to Excel tables:
SCV creates an extra column for score values, as opposed to including them in the row labels column
BLC includes blank rows in Excel tables (as specified by the relevant MERLIN formats, or by using * in text) – by default all blank rows are suppressed
BTL causes significance flag rows (e.g. generated by format TTS) to be output when there are no flags – giving the table a more uniform appearance.
There is also a label control <PX> which is used only when outputting tables to Excel, and which forces a banner to be split on that item. Excel tables usually use as many columns and rows as are needed to display the entire table, i.e. there are no “page breaks” – but it may sometimes be desirable to split the banner into two or more sections.
Any questions? Email support@merlinco.co.uk.
In blog #8, we discussed weighting in general, the terminology used, and how to do some types of weighting in MERLIN. In this blog about rim weighting, we will assume you are familiar with the terminology but, if not, we advise you to read blog #8 first.
How rim weighting works
As an example, we will use a project with two rims – gender and age – and the targets shown below.
Just to be clear, we cannot use target weighting because we don’t know the targets for the interlaced matrix, e.g. we don’t know the target for males aged 16-34, only the ‘edges’ or ‘rims’ of the interlaced matrix.
Some people have approached this problem by calculating the factor needed to achieve the gender targets then, having applied this factor to the data, calculated the factor needed to achieve the age targets – then multiplied both factors together. Although this will achieve the age targets, it is unlikely to achieve the gender targets – and if we do the whole process the other way round, the gender targets will be achieved but not the age ones. Rim weighting, however, is a way of achieving two or more sets of targets at the same time.
Unlike target weighting, it is virtually impossible to calculate rim weighting without a computer, since it involves trying lots of different factors until we find the ones that achieve all the targets. These attempts to find suitable factors are known as iterations. Sometimes it simply isn’t possible to achieve all the targets, and MERLIN will report this (but see next section below). Another point to understand about rim weighting is that different programs use different methods of calculation, so may end up with different factors. Although weighted figures would be the same when tabulating gender or age, they may not be the same when tabulating other variables.
Run Control Parameters (RCPs) MAXRIMIT and OKRIM
By default, MERLIN will do 500 iterations before abandoning rim weighting calculations – but this can be increased with MAXRIMIT=n, where n is any number up to 32767. In practice, if MERLIN cannot achieve the targets in 500 attempts, it cannot usually achieve them in more – but it is worth a try. There are two main reasons why rim weighting targets cannot be achieved:
RCP OKRIM specifies that MERLIN will continue with the run even if the targets are not achieved. Sometimes in this situation, inspection of weighted tables shows that the results are close enough or, because of rounding, they may even appear to be what is required.
Calculating weighting for two rims
As with target weighting, the calculation is done with MERLIN’s powerful MANIP facilities.
First, in the Data Stage we need to generate a table (#RIMACT) showing the actual (unweighted) figures for gender by age, then three empty tables to be used in MANIP – a ‘total only’ table for each rim variable (#RIMV01 and #RIMV02), and another table of gender by age (#RIMWGT) that will contain the weighting factors calculated.
F=NPTB, !don’t print these tables
T#RIMACT = $GENDER*$AGE,
T#RIMV01 (F=NITB) = $GENDER*,
T#RIMV02 (F=NITB) = $AGE*,
T#RIMWGT (F=NITB) = $GENDER*$AGE,
Then, in the Manip Stage, we specify the target figures for each rim variable in #RIMV01 and #RIMV02. As with target weighting, it doesn’t matter whether these figures are numbers or percentages (and they could even vary between rims) providing the first figure is the total of the others.
To get the weighting factors into table #RIMWGT, we use function %RIM. The first argument is the total weighted base required, the second is the name of the table containing the actual figures, and the remainder are the names of the tables containing the rim targets. If the total weighted base is required to be the same as the unweighted, specify the first argument as U.
MT#RIMV01 = (100.0,42.0,58.0),
MT#RIMV02 = (100.0,31.0,33.0,36.0),
MT#RIMWGT = %RIM(1500.0,#RIMACT,#RIMV01,#RIMV02),
Finally, in the Tables Stage, we extract the weight from the relevant Row and Column of table #RIMWGT, and apply it to all tables following.
DW $WT=#RIMWGT(R$GENDER,C$AGE)
SELECT WR $WT,
A complete example script can be found in item 11.6 of the MERLIN Tips and Examples library.
Using more than two rims
If we have more than two rims, we need to interlace some of them so we still end up with two variables to use in tables #RIMACT and #RIMWGT. If, for example, we have rims for gender, age and area, we could interlace gender and age…
DS $SIDE=$GENDER.BY.$AGE,
… then tabulate $SIDE * $AREA.
If we have four rims, we could interlace the first three, and so on … but might eventually break some limits. The maximum number of rows in a table is 32000 and the maximum number of columns is 1500 so, if interlacing all except the last variable gives more than 32000 cells, we must also interlace some variables to be tabulated across the top, e.g.
DS $SIDE = $GENDER.BY.$AGE.BY.$CLASS.BY.$MARITAL.BY.$HHSIZE,
DS $TOP = $AREA.BY.$CARS,
The maximum number of rims is 100, and the maximum number of cells when interlacing them all is 1,228,800.
Further examples can be found in items 11.7, 11.8 and 11.9 of the Tips and Examples library.
Any questions? Email support@merlinco.co.uk. ls
Weighting is a vast subject, and MERLIN can handle anything from the simple application of existing factors to the calculation of complex multi-stage weighting.
In this blog we will start from scratch by defining the terms used in weighting and giving some simple examples, then expand on this in future articles.
Respondent and quantity weighting factors
A weighting factor can be a wvar or an ivar (a variable containing a number with or without a decimal point) and is used when incrementing tables, marginal counts, and frequency counts (but we will confine our discussion to tables). So, if a respondent has a weighting factor of 3, they will be counted three times instead of once when incrementing a table.
MERLIN distinguishes between respondent weighting (WR) and quantity weighting (WQ) and, to illustrate the difference, we will take the example of an ivar called $CARS containing the number of cars each respondent owns. If we use this statement…
SELECT WR $CARS,
… all tables following will be weighted by $CARS until we specify SELECT WR with another variable, or SELECT WR OFF to stop weighting. By default, MERLIN will show an unweighted total row (the number of respondents) and a weighted total row (the number of cars), and all other numbers and percentages in the table will relate to the weighted data. Whenever WR is used, MERLIN creates two internal tables, one for unweighted and one for weighted data, so although unweighted figures are usually shown only in the total row, they can be shown anywhere if appropriate formats are set. If, however, we specify…
SELECT WQ $CARS,
… only the unweighted total row will be shown, but it will contain the number of cars – in other words, the table is incremented entirely in terms of cars rather than respondents (so there is only one internal table). Some may not view this as weighting in the truest sense, but simply a way of counting the data. When doing this, it is sometimes necessary to count different parts of a table using different factors so, to save repeatedly specifying SELECT WQ, MERLIN allows you to add the relevant variable to the end of the table statement, e.g.
T#1 = $SIDE1 * $TOP + $FACTOR1, !apply quantity weight $FACTOR1 +T#1A = $SIDE2 * $TOP + $FACTOR2, !apply quantity weight $FACTOR2 +T#1B = $SIDE3 * $TOP + $FACTOR3, !apply quantity weight $FACTOR3 …
A factor specified in this way will temporarily replace any factor specified with SELECT WQ.
SELECT WR and SELECT WQ may both be applied to a table, and will generate an unweighted total representing the number of cars (in our example) and a weighted total representing the number of cars weighted by some additional factor.
The usual understanding of the term “weighting” is respondent weighting (WR) and, unlike quantity weighting, it is unusual for different factors to be required within a single table, or even within an entire run. Respondent weighting is usually done to correct imbalances in the sample, e.g. we have failed to interview enough females, so we give more weight to their answers
From now on, we will assume that we are using respondent weighting, and that it is being used for its usual purpose of correcting an imbalanced sample.
The weighting matrix
We first need to identify the groups to which different weighting factors will be applied, and will use the simple example of male and female. Together, these groups constitute a weighting matrix and each group is known as a cell. It is important that every respondent falls into one cell and one only, so they each receive one factor.
We don’t usually know the factors when we start the analysis, but should know the targets, i.e. the ideal number of respondents in each cell (also known as universe or population figures) – so we call this target weighting. Targets may be expressed in various ways such as percentages or population estimates, but it doesn’t really matter since they are essentially ratios, showing the target proportion in each cell. Let’s assume we have them as percentages, as shown in column (a) below, and the actual number of respondents is shown in column (b). The factor for each cell is the target divided by the actual sample, so that gives us the factors in column (c) and, if we apply them, we will arrive at the weighted sample in column (d).
Our main aim has now been achieved in that the cells are in the correct proportion but, as it stands, the total weighted base will be 100, which probably isn’t what we want – maybe we want a population estimate in thousands, or we want it to be the same as the unweighted. Either way, we can simply apply an overall “grossing-up factor” of the figure required divided by the current weighted total, e.g. 56100 / 100. In other words, we don’t need to convert all our target percentages into numbers.
The good news is that everything can be done in a single MERLIN run which increments the number in each cell, calculates the factor, then applies it to produce weighted tables. New users are sometimes surprised that MERLIN has no specific statements or functions for doing this – but that is because the MANIP stage already provides a powerful tool which enables us to treat a MERLIN table like a spreadsheet where we can reproduce the above table then apply the factors calculated. Item 11.2 of the MERLIN Tips and Examples library shows an insert file (PTARG.INC) which has been developed for this purpose, and can be used in any script where target weighting is required – such as example item 11.4.
Interlaced matrices and rims.
Let us now suppose the weighting relates to more than one variable so, as well as the 2 gender groups, we also have 4 age groups and 3 social class groups. How we proceed depends whether we have interlaced targets (i.e. 2 * 4 * 3 = 24 figures) or only the totals for each variable (i.e. 2 + 4 +3 = 9 figures).
In the first case, we can create a single variable which interlaces gender, age and class, so it is another example of target weightingdiscussed above. The interlaced variable is easily created with this MERLIN statement…
DS $MATRIX = $CLASS.BY.$AGE.BY.$GENDER,
… in which the first variable is the ‘outer loop’, i.e. the cells will be in this order…
$CLASS/1, $AGE/1, $GENDER/1, $CLASS/1, $AGE/1, $GENDER/2, $CLASS/1, $AGE/2, $GENDER/1, … and so on.
The second case described above is called rim weighting because we only know the rims (i.e. the totals), and we will discuss this in a future blog.
Applying calculated factors
Once we know the factors to be applied, we can use a ‘data lookup’ statement to specify the factor for each cell, maybe gross it up, then apply it, e.g.
DW $FACTOR = $GENDER (0.0996,1.1059), DW $FACTOR = $$ * 56100 / 100, !gross total up to 56100 SELECT WR $FACTOR,
The number of factors in brackets must equal the number of items in the matrix variable.
Since MERLIN runs so fast, users often allow it to re-calculate the factors in every run but, if you are doing many runs using the same weighting factors, it makes sense to replace the code that calculates them with the code above.
Any questions? Email support@merlinco.co.uk.
Market research data has traditionally contained very few character fields, and there have been even fewer requirements to process them. This has changed in recent years, however, and instead of gender being coded as 1 and 2, we are seeing data files that contain character strings ‘male’ and ‘female’. This is especially true of Excel data, which MERLIN is well able to process, and MERLIN also has powerful facilities for processing character data, some of which we will highlight in this blog.
Character variables
A MERLIN character variable (cvar) may be up to 9999 bytes in length. Each byte usually holds one character but UTF-8 data that cannot be converted into ANSI (e.g. Chinese) needs two or more bytes for each character. Note that UTF-8 files may be used freely in MERLIN and usually the user doesn’t even need to be aware of file formats but, in the single case of processing cvars derived from non-Latin data such as Chinese, RCP NOLATIN must be set, and the user needs to be aware of the additional storage space required. We will not discuss this further here, and all examples in this blog will assume the default setting RCP LATIN.
Let us suppose that gender is stored in our data file as a string of up to 6 characters. If we are using ASCII data, the definition might look like this…
DC $GENDER=$21-26, !pick up from columns 21-26
… or, if we are using Excel or CSV data, like this…
DC $GENDER(6)=$*5, !pick up from field #5
In the first case, the cvar length is implied by the length of the pickup field but must be explicitly stated in the second case, since the field could be any length.
The initial value of all cvars is Undefined (U) so, if a definition is bypassed by branching statements such as IF or FILTER, the cvar will retain this value – but if the definition statement is executed, the cvar will then contain exactly what is found in the data, e.g. ‘Male’, ‘FEMALE’, or maybe blank. If an Excel or CSV field is longer than the cvar’s length, a warning will be issued and additional characters ignored.
Listing, tabulating and exporting cvars
Cvars may be shown in report files using the LIST statement, or you can use a FREQ statement to generate a frequency count of all the different character strings found – particularly useful if you are unsure what the data file contains.
Cvars may be written out to new data files with the OUTPUT statement, or exported to other formats and packages via MERLIN Toolkit.
Cvars up to 1024 bytes in length may be tabulated on the vertical axis of a MERLIN table, showing all the character strings found in alphabetical order, followed by the number of blank or undefined records (if any).
Literals
Let us now suppose we wish to create a single coded variable (svar) called $GENDERX from our cvar $GENDER. We can do this with the following code:
DS $GENDERX=$GENDER/
‘male’; ‘Men’,
‘female’; ‘Women’,
‘male’ and ‘female’ are the strings we are looking for in the data, and are known as literals. ‘Men’ and ‘Women’ are the labels to be used in tables, which we have deliberately changed to illustrate the fact that they may differ from the literals.
It is not essential to create a cvar first – you can define the svar directly from the raw data location. In either case, if the literal is shorter than the entity being tested, MERLIN will assume there are trailing blanks at the end of the literal.
If you run a frequency count on the cvar or the raw data, MERLIN will generate code similar to the above in an FRQ file, which can then be incorporated into your script.
By default, tests on literals are not case sensitive, so the first category above would include ‘male’, ‘Male’, ‘MALE’, or even ‘mALe’. If you want the test to be case sensitive, set format ICL (Ignore Case in Literals) false.
If you know that your data contains variations of the literals, you can include these in the definition, e.g.
DS $GENDERX=$GENDER/
‘male’ + ‘M’; ‘Men’,
‘female’ + ‘F’; ‘Women’,
… or you could use ‘pattern matching’, described below.
Pattern matching
By default, MERLIN recognises two characters in literals as representing any character(s):
? means any single character (including a blank space)
* means any sequence of characters of any length
So, ‘A?C’ would pick up ‘AXC’, ‘A3C’, ‘A#C’, etc, and ‘A*C’ would pick up ‘AXC’, ‘AXxC’, ‘AX3xC’, etc. This means we could simplify our last example to…
DS $GENDERX=$GENDER/
‘M*’; ‘Men’,
‘F*’; ‘Women’,
If you need to test for ‘?’ or ‘*’ in a literal, you can precede them with the MERLIN escape character (the grave accent ‘`’), e.g. ‘Why`?’ means you are looking for the actual string ‘Why?’. Alternatively, you can disable pattern matching completely by setting format PML (Pattern Matching in Literals) false.
Cvar substrings
It is possible to isolate any part of a cvar (known as a ‘substring’) so, for example, $CVAR(2:3)/ ‘LM’ tests for characters 2-3 being ‘LM’.
Either or both of the substring boundaries may be integer constants, or ivars, e.g. $CVAR($IVAR1:$IVAR2) – which opens up powerful possibilities for processing free format text strings, e.g. a comma-separated address field within ASCII data.
If the substring reference starts on the first character, you may omit the value before the colon and if it ends on the last character, you may omit the value after the colon, e.g.
DC $PART1=$CVAR(:6),
DC $PART2=$CVAR(7:),
Functions
MERLIN has numerous functions that facilitate the processing of character data, including:
%CLEN | determines the length of a character string |
%INDEX | determines the position of one character string within another |
%LOWCASE | converts a character string into lower case |
%UPCASE | converts a character string into UPPER case |
%NUM | converts a character string into an ivar or wvar |
%STR | converts an ivar into a character string |
For more details about these and other functions, see the ‘Functions’ chapter of the MERLIN Help File.
Any questions? Email support@merlinco.co.uk. x
This blog looks at how to analyse additional data in MERLIN that isn’t in the main survey data file – a typical example being codes created from open-ended questions and entered in a secondary file together with serial numbers that match the main data file.
Merlinco software offers many different ways of dealing with this situation, and our object is to discuss the pros and cons of the main approaches so you can select the most suitable. We won’t be describing each method in detail, but will point you to the relevant features in the Help files.
IF statements – the worst method!
A common approach is to edit the secondary file to produce a series of IF statements which are then incorporated into the MERLIN script, specifying the action required for each serial number, e.g.
IF $IOBS/1, D $91-93=546,
IF $IOBS/2, D $91-93=701,
IF $IOBS/3, D $91-93=192,
…
This method is extremely inefficient and, if there are thousands of serial numbers involved, can create huge script files and cause limits to be exceeded. It is worth noting that this approach would be far more efficient (and run faster) if the code was changed as follows…
IF $IOBS/1, THEN, D $91-93=546,
ELSEIF $IOBS/2, THEN, D $91-93=701,
ELSEIF $IOBS/3, THEN, D $91-93=192,
…
ENDIF,
… but it would be even better not to use this approach at all!
Joining files with MERLIN Toolkit
On the Tools menu in MERLIN Toolkit, there is an option to ‘Join data files horizontally’, which enables you to match records in the main and secondary files, and append the latter to the former.
This approach is quick and easy to use, and means you can then forget the secondary file and simply define extra variables from data locations at the end of the record. It doesn’t matter whether or not secondary data exists for every record in the main file – and the program can also cope with any secondary data records not matching the main file – but there are two disadvantages:
a) You have no control over which secondary file data items are added, or where they go in the main file – the entire secondary file record will always be appended to the main data record. One consequence of this is that the serial number will be in two places in the record – which doesn’t matter in itself but, if the secondary records are very long and you only want some of the data, it is generally inefficient.
b) You may only use ASCII or CSV data files, and both files must be the same type – so you can’t append data from a CSV file into an ASCII one, and cannot use other file formats at all (although you could save Excel files in CSV format of course).
Using two data streams in MERLIN
MERLIN allows you to define two input streams to be read side-by-side – so you can use INPUT(A) for the main data and INPUT(B) for the secondary data. In effect, this does the same as MERLIN Toolkit Horizontal Join but within MERLIN analysis run – although you can, if you wish, output a new data file (containing whichever data you wish in any locations you wish) to be used in subsequent analysis runs. There are, however, several disadvantages:
a) As with Horizontal join, the two streams must both contain ‘freefield’ or ‘fixed location’ data files – although, unlike Horizontal Join, you can use Excel or card image files.
b) The files in both streams must be in serial number order (or whatever the matching key is).
c) Writing the MERLIN script to use two input streams is fairly simple if both streams contain the same serial numbers but, if there are any mismatches, it becomes more difficult – especially if either stream can contain serial numbers which are not in the other!
In most cases, MERLIN Toolkit Horizontal Join is preferable to this method.
Reading secondary data alongside MERLIN IDF
A variation of the above approach is to read the secondary data file alongside an IDF (MERLIN’s Internal Data File) previously created from the main data file. This is done by specifying both ‘IDF=’ and ‘INPUT=’ in the MERLIN Control Stage.
This saves re-reading the main data file, and allows you to use any type of secondary data file. Other than this, however, it has the same disadvantages as using two data streams – plus the fact that there is no way of sorting an IDF so, if it isn’t in serial number order, you can’t use this method.
The MERLIN %LOOKUP and %GETLKFD functions
This is the most flexible approach.
Up to 124 LOOKUP files may be defined in the MERLIN Control Stage, and they may be CSV, tab-delimited, or Excel files – irrespective of the format of the main data (which could be an IDF). Neither the main data file nor the LOOKUP file needs to be sorted in any particular order.
Using the %LOOKUP function you may read whichever fields you wish into a variable or a raw data location. For example, the following statement searches LOOKUP file (A) for the record where field 1 matches the serial number ($IOBS) in the main data file, then puts the contents of field 10 into the variable $IVAR.
DI $IVAR = %LOOKUP(A,$*10,$*1=$IOBS),
You may refer to LOOKUP file fields by number or letter or (if a header row is present), by name. You may use any of the six relational operators when testing for a match, e.g.
DC $CVAR = %LOOKUP(A,$*’area’,$*’ref2′.GE.$REF1),
Assuming you have your secondary data in a ‘freefield’ file (or can convert it into one), the only significant disadvantage of %LOOKUP is that it can be somewhat slow if very large data files are involved – but, if this is an issue, you can output a data file containing the secondary data to save using %LOOKUP in each run.
Finally, if the matching criteria are more complex, e.g. there may be more than one matching record and/or the match depends on more than one variable, the %GETLKFD function gives you even more flexibility, and allows you to write your own script for reading the LOOKUP file.
Any questions? Email support@merlinco.co.uk.
MERLIN has numerous options for ranking the rows of a distribution table, i.e. displaying them in descending or ascending order of the values they contain.
The minimum specification to achieve this is label control <G1B> (which means ‘Group level 1 Begins’) in the first item to be ranked. Usually this is the first item of the variable, causing all the rows to be ranked, but you may want some unranked rows at the start and/or the end. The latter effect is achieved by including <G1E> (‘Group level 1 Ends’) in the last item to be ranked. Here is an example:
X=’
Client brand; !unranked item at start
Rival brand 1<G1B>; !ranking group level 1 begins
Rival brand 2;
Rival brand 3;
Rival brand 4<G1E>; !ranking group level 1 ends
Other brands; !unranked items at end
Dk/na’,
Which column is to be used for ranking?
By default, rows are ranked using the values in the total column (column 0), but this can be changed with format RNC, e.g. RNC3 means the table will be ranked using the values in column 3 of the banner. If the table contains columns of statistics and you want to base the ranking on one of these, specify the relevant format in ’quotes’, e.g. RNC’MED’ will rank the table using the values in a column of medians.
RNC-1 is a special setting which means the rows will be sorted into alphabetical order of the row labels, regardless of the values.
Ascending or descending order?
By default, rows are ranked in descending order of the values, but format RVR specifies reverse ranking, meaning the rows will appear in ascending order.
In the case of alphabetical sorting, the order is always A-Z.
Numbers or column percentages?
Ranking may be based on the raw numbers, or on column percentages (regardless of which data types are shown). Usually, of course, this produces the same result (and, by default, raw numbers are used) – but if label control <B> is used to change the base for percentaging, you may choose to rank the rows according to the percentage values, by setting format RNV2. You would also, in this example, need to link the base row to the row it applies to, so they are always kept together – which brings us to the next topic…
Linking rows together
<G1N> means this Group level 1 item is linked to the Next item, so should always be moved with it, regardless of the value it contains. This is what we need for the example mentioned above:
X=’
Base for brand 1<B><G1N>; !this item is linked to the next one
Brand 1<G1B>; !ranking group level 1 begins
Base for brand 2<B><G1N>;
Brand 2;
Base for brand 3<B><G1N>;
Brand 3<G1E>; !ranking group level 1 ends
Base for other brands<B>; !unranked items at end
Other brands’,
<G1P> applies the same principle to the Previous item, as in this example where we are ranking three header groups according to the ‘yes’ response in each group:
X=’
Brand 1<g1n>\Yes<G1B>; !ranking group level 1 begins
No<G1P>; !this item is linked to the previous one
Dk<G1P>; !this item is linked to the previous one
Brand 2<g1n>\Yes;
No<G1P>;
Dk<G1P>;
Brand 3<g1n>\Yes<G1E>; !ranking group level 1 ends
No<G1P>;
Dk<G1P>’,
Other types of ranking
You may define multi-level (nested) ranking by using different group level numbers (up to 9 levels are allowed). This can become very complex and we won’t cover it here, but see example 8.1 in the Tips and Examples library. The most common examples of multi-level ranking can be done using the NET command in a variable definition, which handles netting and ranking in a single operation.
You may rank statistics that are in consecutive continuation tables using format RCG. See example 9.5 in the Tips and Examples library.
Disabling ranking
Format RNK is true by default but if set false, will disable all ranking, whether done with label control <G> or NET. This provides an easy way to switch ranking on and off without changing your variable definitions.
Any questions? Email support@merlinco.co.uk.
MERLIN allows you to generate statistical data in rows or columns of a table from single or multi coded variables (svars or mvars) – or in rows only from numeric variables (ivars or wvars). Two different methods are available, which we shall refer to as automatic and manual generation methods. Here is a list of all the statistics that may be generated, and the formats which control them.
BST | Base for statistics |
SUM | Sum of scored values |
SSQ | Sum of squares |
AVG | Average |
SDV | Standard deviation |
SER | Standard error |
EVR | Error variance |
ILL | Lowest score value |
ILH | Highest score value |
MDE | Mode |
MED | Median |
ILEn | Quantiles (automatic generation only) |
When using numeric variables, the last five statistics above may only be produced in frequency tables (not statistics tables).
Automatic generation method (rows only)
Rows of statistics may be generated automatically from svars, mvars, ivars or wvars. In the case of svars and mvars, statistics will only be generated if score values are specified in the labels with label control <Vn>, e.g.
DS $Q3=$20/
1;’Very good<V2>’,
2;’Good<V1>’,
3;’Neither<V0>’,
4;’Bad<V-1>’,
5;’Very bad<V-2>’,
6;’Dk/na’,
If any such variable is used on the vertical axis of a table, one or more rows of statistics will automatically appear at the end of the table, depending on the format(s) set – so, if F=AVG/SDV/EVR, three rows of statistics will appear. By default, they will appear in the order of the above list, but this can be changed with the SELECT STAT ORDER statement.
Note that format ILE needs a numeric setting n between 2 (= median) and 10 (= deciles). This format will produce n-1 rows showing the quantile points, i.e. 1 row for a median up to 9 rows for deciles.
To globally change the row label associated with a statistic, you can re-define the relevant special text, e.g. %AVG – but to change it for a specific variable, it is easier to include a dummy item at the end of the variable, using label control <AA> to assign the label to AVG, or <A1> to assign it to the first statistic shown, e.g. for an svar or mvar:
D;’Product X average<AA>’,
… or, for an ivar or wvar with no other labels:
V=D;’Item A base for stats<A1>’,
Manual generation method (rows or columns)
Rows or columns of statistics may be generated from svars or mvars (providing the labels contain score values as above), by specifying the statistics as dummy items within the variable. This is done using label control <T=xxx> where xxx is the relevant format setting (ILE is not allowed).
Although it requires more typing, this method has several advantages:
Here is an example:
DS $Q3=$20/
1;’Very good<V2>’,
2;’Good<V1>’,
3;’Neither<V0>’,
4;’Bad<V-1>’,
5;’Very bad<V-2>’,
D;’Mean score<T=AVG>’,
D;’Std.deviation<T=SDV>’,
D;’Err.variance<T=EVR>’,
6;’Dk/na’,
Any questions? Email support@merlinco.co.uk.
MERLIN provides two ways of tabulating a numeric variable on the vertical axis of a table – known as statistics and frequency tables. Users are sometimes unclear about the difference between the two, so we will illustrate it using a variable called $NUMBER. This statement…
T#1=$NUMBER * $CLASS,
…produces a statistics table like this (the statistics shown depend on FORMAT settings)…
Often this type of table is joined onto the end of another table (maybe showing ranges of values) – or several statistics tables may be joined together to produce a statistical summary from different variables.
This statement…
T#2=$NUMBER(15) * $CLASS,
…produces a frequency table showing the frequency distribution of the values, as well as the statistics…
The syntax ‘(15)’ tells MERLIN to create a table with 15 rows (plus total, undefined, and the statistics rows) – in other words, we are allowing for up to 15 different values being found in the variable $NUMBER. This is an estimate, and our table shows that only 12 different values were found, so MERLIN automatically suppressed the unused rows.
You may be tempted to specify a high number like ‘(9999)’ but this is very inefficient since MERLIN will unnecessarily create a huge table – which is likely to slow the run down and (if there are many such tables), could cause limits to be exceeded. Since there are only 20 respondents in the sample, there cannot possibly be more than 20 different values, so you should specify 20 or less. If the number you specify is too low, MERLIN will report this and abort the run so you can increase it.
Format TNP
If your script contains many frequency tables, all containing the syntax ‘(100)’ for example, it is more efficient to use format TNP. If you set this to 100, it means that all table specifications from that point onwards which have a numeric variable on the vertical axis but no number in brackets, will generate frequency tables with up to 100 different values. If you want to use a different number (higher or lower), you can specify this in the usual way and it will take precedence over the TNP setting. If you want to produce a statistics table instead of a frequency table, you can specify ‘(0)’. For example…
SELECT BANNER $CLASS,
F=TNP100, | !allocate 100 rows from now on |
T#1=$NUM1 *, | !frequency table with 100 rows |
T#2=$SVAR *, | !not a numeric variable so TNP ignored |
T#3=$NUM2 *, | !frequency table with 100 rows |
T#4=$NUM3(200) *, | !frequency table with 200 rows |
T#5=$NUM4(0) *, | !statistics table |
T#6=$NUM5 *, | !frequency table with 100 rows |
Statistics requiring frequency distribution
The following statistics can only be produced in frequency tables, since the calculation requires a full frequency distribution:
ILE | quantiles |
ILH | highest value |
ILL | lowest value |
MDE | mode |
MED | median |
If you do not wish to show the frequency distribution, you can suppress it by setting format DIS false, e.g.
T#7(F=MED/NDIS)=$NUM6(150)*, !frequency table with median but no distribution rows
Character tables
Finally, although not strictly part of our topic, we will mention that the tabulation of character data (e.g. postcodes) works in a similar way to numeric data. You can either show summary rows only (answered, not answered, undefined) – or you can show the full distribution by specifying ‘(n)’ after the character variable. Format TNP may be used as for numeric data.
Any questions? Email support@merlinco.co.uk.
MERLIN can perform many statistical tests, but the one we get asked about most is how to flag significant differences between cells in the same row with letters – something like this…
The letter highlighted in yellow denotes that the percentage above it is significantly greater than the percentage in column B of the same row at the 99% level – and the letter highlighted in green denotes that the mean score above it is significantly greater than the mean score in column G of the same row at the 95% level.
In the table above, we have set formats TTS3/SHG1, but this article describes many variations allowed.
Which values are to be tested?
This is determined by format TTS, which can be set as follows:
TTS0 = no tests
TTS1 = test mean scores only (using a t-test)
TTS2 = test column percentages only (using a z-test)
TTS3 = test both
Which columns are to be tested against which?
This is determined by format SHG:
SHG-2 = test each column against the remainder (i.e. total column minus current column)
SHG-1 = test each column against the total column
SHG0 = test all columns against each other (the default setting)
SHG1 = test within each 1st level header group, i.e. REGION, AGE, GENDER
SHG2 = test within each 2nd level header group (known in MERLIN as ‘overheaders’)
SHG3 = test within each 3rd level header group (known in MERLIN as ‘superheaders’)
Note that SHG-1 is a special case that should not be used unless the data has been changed with MANIP.
In the table above, we have used SHG1 – as indicated by the footnote which is generated automatically (although its appearance and position can be varied with special text %SHG and formats CGI, CGS and PSF).
Clients sometimes wish to test columns within header groups (SHG1) and against total minus each column (SHG-2) on the same table – and this can be achieved using formats TTS3/SHG1/SGX3. SGX does the same test as TTS3/SHG-2, but flags cells with one or two asterisks (depending on the significance level) instead of letters, and making it easy to distinguish between the two tests. By default, the asterisks are shown in the same cell as the values but format FBC allows you to move them below, and format SMS allows you to show + or – instead of *. Special text %SGX allows you to add text to the footnote.
Finally, you can specify your own pattern of testing with the SELECT SHG statement, e.g. SELECT SHG (1.4) (2.5) (3.6), means test column 1 against 4, 2 against 5, and 3 against 6.
Which levels of testing are to be used?
MERLIN allows testing at two significance levels, determined by formats SLA and SLB. By default these are set to 95 and 99 respectively but can be given any values between 60 – 99.99999, with up to 5 decimal places. If SLA and SLB have the same value, only one level of testing will be done.
Which flag characters are to be used?
By default, the columns on each page will be lettered from A-Z (lower case for SLA level and UPPER case for SLB level). This means that you cannot test more than 26 columns, which can be a problem when outputting tables to Excel with no ‘page breaks’ in the banners – but there are two ways around this. First, format TTL may be set to:
TTL0 = assign letters across all columns (the default setting)
TTL1 = re-start lettering on each test group, as determined by format SHG. So, if F=SHG1, the lettering will re-start on the 1st column within each header group
TTL2 = same as TTL1, but omitting the first column in each test group (typically used when this contains a sub-total). So, if F=SHG2, the lettering will re-start on the 2nd column within each overheader group.
Second, you may re-define and/or extend the string of characters used, in special texts %FCA and %FCB, e.g.
%FCA = ‘abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ’,
%FCB = ‘ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞŸ’,
If you are only testing at one level, you could redefine %FCA to include both upper and lower case letters, allowing a large number of columns to be tested against each other.
Finally, you can assign flag characters manually using label control <T=SMAx/SMBy>, and this method allows you to exclude columns from testing by not assigning any characters to them.
The table below was produced using formats TTS3/SHG1/SGX3/FBC/SLB95/TTL1.
How are low bases managed?
If the unweighted base for a column is lower than the value specified by format TMA, it will be marked with a flag character and, if it is lower than the value specified by format TMB, it will be flagged and excluded from significance testing. The flag characters are controlled by formats SBA and SBB.
By default, columns will be flagged with * and excluded from testing if the base is lower than 30.
What else?
Here are some other significance tests that can be done in MERLIN, using the formats shown:
CHI chi-square test
CHS single sample chi-square test
DEP dependent t-test
KST Kolmogorov-Smirnov test
LSD least significant difference
MWW Mann-Whitney-Wilcoxon test
TTF f-test
Any questions? Email support@merlinco.co.uk.
When computers were first used to analyse market research data the most common means of storing data was column binary (card image) format but, with the advent of telephone and web interviewing packages, ASCII data became the favourite. Now, however, more and more data are stored in Excel or CSV files – and there are numerous advantages to this approach including:
MERLIN can read and write data in Excel or CSV files, and the principles described below apply to both. We refer to data fields / cells in such files as freefields (free format fields) to distinguish them from fixed location fields in ASCII and column binary files, which we have traditionally called fields.
Assuming that the first row in each data file is a header containing the names of the variables (which is nearly always the case) you will need to specify one of these RCPs (Run Control Parameters):
The RECORD LENGTH statement needs to specify the number of freefields in your data file(s), rather than the number of characters, e.g. RECORD LENGTH 57 FIELDS. There could, in fact, be more freefields, but this would mean you are only analysing the first 57.
To refer to a freefield, you can use one of three approaches (which can be mixed freely within the same script). In all cases, the reference starts with $* (as opposed to $ on its own, used to refer to ASCII and column binary data locations):
The last approach is the most flexible, since Q25 could be in any location, and its position may vary from file to file (which is often the case when you are analysing different waves of the same survey together).
Another advantage of Excel or CSV files is that multi-coded variables can be stored in a single freefield, as opposed to the traditional approach of entering code 0 or 1 in a series of locations (known as a ‘bitstring’). The bitstring approach may still be used with freefield data but the former is easier to work with, and more flexible if the number of valid codes changes. A single freefield could therefore contain “1,7,10” to signify these 3 values (the quotation marks are optional in Excel, but compulsory in a CSV file). To indicate that a freefield may contain more than one value, an extra asterisk must be added after the freefield reference, e.g.
If your data file is not in Excel or CSV format, you can use MERLIN Toolkit to convert it, e.g. you can import from SPSS, and you can automatically generate a MERLIN script defining the variables.
Any questions? Email support@merlinco.co.uk.