Written & maintained by Hans Paijmans
paai@kub.nl
Work in progress! Not finished! Use 'reload' often!
Last changes: first week august 1996.

The goal of this part of the course is to give enough information,
that the student can conduct experiments with the SMART information
Retrieval system.
During this session we will demonstrate the use of the system by
annotated screendumps rather than by allowing direct contact with the
system. There are a number of reasons for this.
Please note that this tutorial is by no means complete and may contain various errors!
Does Smart run on a PC?

The actions of Smart are controlled by the
so-called 'spec-files'. These files consist of parameter-value pairs
(refer to Example 1 and
observe how every line consists of such a pair).
Before a database can be indexed and used, we need to prepare at least two help-files, to whit a file that describes the location of the datafile(s) to be indexed and one or more spec-files that describe the data and the actions of Smart. The default values that are valid for most databases can be found in the files 'spec.default'.
Often these values are superseded by those in a file with experimental values 'spec.expcoll'. Also, if you decide to use a list of stopwords, this list has to be prepared.
Actions that describe the data and actions of the individual database are collected in the file 'spec.data' (in the original Smart documentation this file is called 'spec'; for clarity I prefer 'spec.data'). As you inspect these files you will see that 'spec.data' includes one of the files 'spec.expcoll' or 'spec.default' and that 'spec.expcoll' includes 'spec.default'. More information on the spec-files may be found in the original smart documentation files app.spec and defaults.
A relatively clear description of all the fields of the 'spec-file' can be found in 'admin'. You might compare the spec-file that is described in admin to the file 'spec.data' that is used for our examples.
The texts that are to be used with SMART should be in ASCII. Smart has a pre-processor that among other functions recognizes separate records and fields.
Make sure that you are still looking at spec: example 1.
First observe the three lines that define where the database and the spec.default may be found and what file contains the list with documents to be indexed. Don't be disturbed by the direction of the 'slashes' in the directory names: that is common Unix-usage.
The documents in our example are structured in such
a way that every new field starts with two capitals and a colon; a new
record starts with the string "Rec:". Therefore we must tell the Smart
preprocessor what fields he must recognize and what action it should
take.
Reload spec: example 1.
Then we tell the preparser to mark every line that starts with Rec: as the beginning of a new document, but not to include it in the indexing process. Because the default action for every section was set on 'copy', we explicitly have to include the action 'discard' for this section. In the following lines this action is not repeated; TI: just indicates that the text that follows is the title and that it should be copied to the Smart main processor. In this manner all parts of the document (fields) may be marked for the real parser. By the way: note that only the first character of the section_name (here 'title') has meaning.
In the second example is shown how every field is prepared for
the indexing proper. For the moment we will skip over the exact
meaning of the attributes, but we will meet the ctype again
later in the tutorial.
After you prepared a 'spec.data'-file like this one for your database, the time has come to consider the indexing proper. Please remember that a detailed example of a spec-file may be found in the original documentation, in the file'admin'.
Your data directory now should look like
this..
Now for a question that should demonstrate if you understood what files are needed before you can start the indexing process.
How many textfiles are at least needed to index a database?
Make sure that you are looking at the data directory as it should
look before the indexing process starts.
You can inspect the contents of the files by clicking on the names (however, the 'exisummer.txt' will not be loaded completely). Note how the file 'doc_loc' contains a list of the documents to be indexed: in our example that is only one file. (Do not forget to use the righthand mouse button to switch back and forth in the Examples-frame).
In case you are wondering about the function of the file 'common_words': Smart can both do stemming (i.e. truncate words before they are added to the index) and use a list with unwanted words. In 'spec.default' the last option was set, therefore we will need a file with the name 'common_words'.
Every action of Smart is initiated by calling the program with two
or more parameters. In many situations these parameters will be the
'spec-file' of the database and the action required. To index a
collection the first parameter would be index.doc and the
second spec.data.
Let's try it. We give from the unix command line (#) the command:
# smart index.doc spec.data < doc_loc
and after that, as you will presently see in the frame below, there will be some new files in the directory. This is all that is required to create a database and now you can use the interactive mode of smart to query the database.
Errors: Errors are most likely caused by entries in one of the spec-files that point to non-existing files. The file 'common_words' in the 'spec.defaults' is my favorite, but 'spec.defaults' itself may cause problems if it is referred to in 'spec.expcoll'. So check and doublecheck all spec-files for pathnames!
Whereas all the files needed to start the indexing process Example 3 were plain text files, this is not the case for the completed indices. You need Smart to read and display the contents of those files.
It may be interesting to know that there is a small shellscript smprint that calls Smart to print the information in these files. The following command will print the contents of the file 'dict' (concept to token mapping).
# smprint -s spec.data dict dict
and
# smprint -s spec.data inv inv.nnn
will print the inverted file (stored in 'inv.nnn' and 'inv.nnn.var').
What is the most common error when administering a Smart database?
Was 'stemming' enabled during our examples?
(refer to the example
when the 'dic'-file was printed above)

After all these tribulations you can finally start Smart as an interactive process with the command:
# smart inter spec.data
Now you can do a simple keyword search by typing the command 'run', followed by keywords, e.g. 'information retrieval'. When you are finished, type a dot on a new line and a list of selected documents will appear:
Smart (ntq?): run information retrieval .
Do not forget the dot!
The first column displays the document number, the second column
the similarity between the query and the document and the third column
the title of the document (if any).
Pressing
return at this point will
display the first (next) complete document or you can enter a
number.
Please note how some fields in the example (TITLE, AUTHOR) are formatted and some (TN:, TY:) are not. This is controlled by the last line of our spec-file, but such a format-string can also be entered on the command-line and then overrides the entry in the spec-file. The syntax of the format string is C-like:
print.format " TITLE: \n%t\n AUTHOR:\n%a\n ABSTRACT:\n%d\n CATEGORY:\n%r\n KEYWORDS:\n%k\n ANNOTATION:\n%n"
If you do not speak 'C', you should be aware of the fact that the \n stands for "new line" and that %t, %a, %d and so on are the identifiers of the different fields. You can toggle the formatting of documents by the command 'Raw_doc'.
Of course, I would have been more careful suppressing the printing of unwanted fields such as TN: or TY:, but it serves its turn as a demonstration of formatted and unformatted fields.
How do you display any record from the database?
How do you cause the fieldnames TITLE, AUTHOR to be written on the same line as the first line of the data?

So far we only had the 'inv.nnn' and 'inv.nnn.var' files that stored the keywords together with the term frequency. This is about what the 'normal' IR systems of today do and if this was all, we would not need to bother with the quirks of Smart documentation.
However, an important propetry of any IR or database system is the ability to restrict keyword matches to selected fields, the so-called 'field control'. This needs some adaptations in the 'spec.data'-file.
Fields in Smart are defined, as we have seen, in the 'spec.data'-file. Perhaps you will have observed the term ctype in those spec-files. You can define a number of those ctypes and one of their properties is that they can be indexed in different inv-files.
See the following example from the 'spec.data'-file. The number of
ctypes is set to 2 and, more important, two lines are added with the
information that the second ctype should be indexed in 'dict.tit' and
'inv.tit' (of course you are not upset by the fact that the second
ctype has the number 'one'; programmers like to count from
zero). After indexing the database you can use field-control by
starting the query by the name of the field as it appears in the
database, in our example TI:.
Smart (ntq?): run TI: information retrieval .
Possibly the most important feature of Smart is
that it offers a multitude of ways to weigh the relative importance of
terms in a document. In the table below a survey is printed of all the
possibilities. We will concern us here with the atc variant
(try to identify the meaning of this term from the table). To
build indices with atc-weights we first need to index the database in
such a way that we also have the document vectors lined up.
To do this we include the 'spec.expcoll' in the
second line of the 'spec.data' file. Then we have to erase the existing
indices (dict*, textloc* and inv*) from the directory and index the
database again. After inspection of the directory we see two new
files: 'doc.nnn' and 'doc.nnn.var'. We can display the contents again
with smprint:
# smprint -s spec.data vec doc.nnn
Now the going gets tough! Smart has to use the information from both doc.nnn and inv.nnn to create
# smart convert spec.data proc convert.obj.weight_doc in doc.nnn out doc.atc doc_weight atc
which gives us the document vectors with atc-weights (doc.atc and doc.atc.var) and
# smart convert spec.data proc convert.obj.vec_aux in doc.atc out inv.atc
which gives us the inverted file with atc-weights (inv.atc and inv.atc.var).
The directory of the complete Database now can be viewed in the frame below.
Do you need to create 'nnn' and 'atc' indices to enable field control?
To query Smart with other indices than the default nnn-indices with word frequencies, you have to create an alternative spec-file. We called it spec.data.atc and as you will see, it just includes the file 'spec.data' and adds the *.atc indices. We now start Smart with the alternative spec-file:
# smart inter spec.data.atc
and get the normal screen. But if we again ask the query
Smart (ntq?): run information retrieval .
We get a different list (compare with the nnn query-result and switch back and forth a
few times). Note that in the atc-variant the similarities are
expressed in weights between zero and one, whereas the nnn-variant
gives weights that are integers.
If you would have access to the
database, would you expect a difference in length between the records
returned for the atc or for the nnn-query?
If at the Smart command-line we give the command
Smart (ntq?): adv
this causes the appareance of a screen with the so-called 'advanced' commands. Identify the command Dvec and try
Smart (ntq?): Dvec 12
You will see the document vector of document 12. Disregard the column ctype for the moment and observe columns Con and Weight. You will remember that the 'Con' is the identifying number of a keyword (concept) and 'Weight' gives the atc-weight of that keyword in that document.
If you want to see all the atc-weights of that particular word (take the first word, introduc with conceptnumber 1139, in your database, you can type
Smart (ntq?): Inv 1139
The columns indicate from left to right the document, the ctype, the conceptnumber and the weight.
In case you are wondering how Smart computes similarities between
records, or between records and queries, you can again start Smart in
'nnn'-mode (with the command smart inter spec.data and ask te system
to compute the similarity between two records, e.g. 12 and 13 with
Dsim and then the number of matching words with
Dmatch.
In the nnn-table the results of both operations are
shown. Of course the computing of weights and similarities is what is
Smart is all about and you could do worse than read some of the
authorative texts on the subject, such as Salton & McGill's
Introduction in Modern Information Retrieval (McGraw-Hill
1983).
Now quit Smart, start it again in 'atc-mode'
and give the same commands again. The result is in the atc-table.
You will observe that with both commands the similarity of the two records is the sum of the fifth column: in fact Smart computed the inner product of the two document vectors.
Finally print a list of documents, ctypes, conceptnumbers and keywords with:
# smprint -s spec.data vec_dict doc.atc

SMART is very much an Unix program in that it accepts input from stdin and writes to stdout. This makes it possible to use the program for complicated or specialized needs in a way that would have been impossible in an MS-Windows environment. Indeed all the examples that originally come with Smart are written in the Unix cshell script language, which is an added hurdle for non-Unix adepts.
If you try to compare Unix with a Microsoft product such as MS-DOS, MS-Windows or Win95, you do not so much compare the performance of the two systems (Unix will win that hands down) but actually two philosophies for problem solving.
Essentially, Microsoft products try to shield the user from the intricacies of the system at the cost of performance, clarity and flexibility, whereas Unix makes available tools that are optimized for preformance and flexibility and leaves it to the user how he wants to apply these tools.
This is best demonstrated by considering how you perform a simple task in both operating systems, e.g. the sorting of a list with names and addresses, selecting all people from Tilburg and printing the last twenty addresses. In a Microsoft environment you would have to load a complete database management package of perhaps twenty or thirty megabytes, including graphic tools, spelling checkers and advanced statistics and click around with mouse on popups and dropdowns. In Unix you would apply three small tools to construct a so-called pipe:
sort < list | grep Tilburg | tail -20
and continue with your work.
Because of the great number of parameters that
can be passed to Smart, it makes sense to write scripts. An easy
script is the script make_index that takes care of erasing old
indices and then does a complete re-index of the database. This comes
in handy when you have a working set of spec-files, but want to tweak
them for different ctypes, re-indexing the database in
between to see the results.
Another obvious example is smprint, but it may look a bit complicated to the novice. Nevertheless it is relatively easy to use, as we have seen above.
Using the normal Unix tools, the output of smprint can be written in any format:
smprint -s spec.data vec_dict doc.atc | cut -f1,4,5 | sort +2
In the same way queries can be piped into
Smart. Any program that can create ASCII files can collect queries
from users, start Smart and display the results. For instance, the
following unix-script 'query' can be used together with redirection in the unix command
smart inter spec.data < query > resultto automatically run the query and print the results to a file named 'result'.
The application of HTML-forms and writing of cgi-scripts lie beyond the scope of this tutorial. Nevertheless we will present examples of the steps to be taken.
The first step is the construction of the form (don't send this
one, it is disabled and will cause an error message). The source of
this form can be browsed using the facilities of netscape. The form
points to a cgi-script
that activates one of the three query scripts, depending on the kind
of display the user has selected. The query scripts (use query_titles as an
example) first write a temporary file with the Smart commands,
including the keywords, and then call Smart with this temporary file
as input.
You will realize that the examples given here are rough-and-ready, aimed at exposing the principles rather than at the creation of a smooth WWW-page. Nevertheless it is hoped that they will point the way to how to drive a information service using Smart.

For anyone interested in the analysis of text and documents in the Unix environment it should be useful to realize that there exist quite an number of programs that can be used together with Smart or that in a different way analyse texts. I collected a spate of them (that run under Linux) as Paai's Text Utilities

SMART itself is available as C-source ftp://ftp.cs.cornell.edu/pub/smart/smart.11.0.tar.z . It was written for Unix on a Sun station, but after some minor tweaking it compiles and runs beautifully on a common Intel-PC (if you use the freeware Unix-clone Linux in stead of the ubiquitous MS-DOS or MS-Windows). The Linux binaries can be downloaded from ftp:pi0959.kub.nl/pub/smart/smart_linux.tgz, but please refer to the original sources for legal limitations.
There exists a mailinglist: 'smart-people@CS.Cornell.EDU' for problems, but don't expect to be helped with newbee-questions. For a HTML-version of the extant documentation see my WWW-site( the use of Smart). Some links from this page will also refer to these original files.