Administering a smart collection

Creating a collection

The most difficult task in all of SMART is collection creation. Unfortunately, it is the task first encountered, since the rest of the system won't do you much good without a collection! To create a collection, the major subtasks are to tell smart
  1. where to find the documents, what format the documents are in, and how to convert them into a standard format.
  2. what the indexable information in the document is, how it should be indexed, and how the indexed documents should be stored.
  3. what format the queries will be in, and how they should be indexed.
  4. what retrieval method should be used to compare queries against documents.
  5. what should be displayed in what format to either an interactive user or an experimenter after retrieval is done.
This information is conveyed to SMART by a collection specification file. So the creation of this file is the primary job of collection creation. A spec file is composed of lines of the form
        parameter_name parameter_value
where parameter_value can take on many types including procedure. Here's the spec file for the adi collection (from smart/src/test_adi/indexed.good/spec, line numbers added here).
1  ## INFORMATION LOCATIONS
2  database              /home/smart/smart.11.0/src/test_adi/indexed
3  include_file          /home/smart/smart.11.0/lib/spec.expcoll
4  doc_loc               /home/smart/smart.11.0/src/test_adi/indexed/doc_loc
5  query_loc             /home/smart/smart.11.0/src/test_adi/indexed/query_loc
6  qrels_text_file       /home/smart/smart.11.0/src/test_adi/coll/qrels.text
7  
8  ## ADI DOCDESC
9  #### GENERIC PREPARSER
10 num_pp_sections                 6
11 pp_section.0.string             ".I"
12 pp_section.0.action             discard
13 pp_section.0.oneline_flag       true
14 pp_section.0.newdoc_flag        true
15 pp_section.1.string             ".A"
16 pp_section.1.section_name       a
17 pp_section.2.string             ".B"
18 pp_section.2.section_name       b
19 pp_section.3.string             ".W"
20 pp_section.3.section_name       w
21 pp_section.4.string             ".T"
22 pp_section.4.section_name       t
23 pp_section.5.string             ".O"
24 pp_section.5.action             discard
25 
26 #### DESCRIPTION OF PARSE INPUT
27 index.num_sections              4
28 index.section.0.name            a
29 index.section.1.name            b
30 index.section.2.name            w
31   index.section.2.method        index.parse_sect.full
32   index.section.2.word.ctype    0
33   index.section.2.proper.ctype  0
34 index.section.3.name            t
35   index.section.3.method        index.parse_sect.full
36   index.section.3.word.ctype    0
37   index.section.3.proper.ctype  0
38 title_section 3
39 
40 #### DESCRIPTION OF FINAL VECTORS
41 num_ctypes                      1
42 
43 ## ALTERATIONS OF STANDARD PARAMETERS
44 dict_file_size                  3563
45 
46 ## ALTERATIONS OF STANDARD PROCEDURES
Line 2 gives the pathname of the database directory. All non-located (ie, not beginning with a '.' or '/') filenames are assumed to be relative to this directory. (Note: this is almost guaranteed to pop up unexpectedly and catch an experimenter eventually!)

Line 3 gives the location of another specification file that contains all the standard defaults for (in this case) an experimental test collection. Most non-experimental collections will include the file .../smart/lib/spec.default instead.

Lines 4-6 give the location of information needed to create the collection. In particular, line 4 tells where to find a file that contains in it the location of the document texts to be indexed. Line 5 does the same for the canned query set (this line would not be included for non-experimental collections). Line 6 tells where to find the relevence judgements (again, only for an experimental collection).

Lines 8-24 describe the text document format (gone into in more detail below). This tells how to convert the original document/query into a standard format that has the document broken up into sections.

Lines 26-38 describes the standard format (also gone into more detail below). This tells what parsing action should be done on each section of the document/query.

Line 38 says that the beginning of section 3 should be used as the title to display to the user for an interactive query.

Line 41 tells how many types of information are in each indexed document vector.

Line 44 says the initial size of the dictionary is 3563. This is discussed below.

There is a lot of information needed by smart not given here; it is hidden in the default spec files included. A collection spec file contains only the information specific to that collection.

Subtask 1: Conversion of documents to standard format (preparsing).

This task in general has to be done by a procedure or program since it's otherwise impossible. However, if the documents can easily be broken into sections based on keywords occurring at the beginning of lines, then the standard "generic" preparser can be used. This is the case for a number of common collection types like mail messages, news messages, standard information retrieval test collection format, and pure text. smart/Sample contains a number of examples of this.

Lines 9-24 in the adi collection above give one of those examples.

Lines 15 and 16 say that if the string ".A" is encountered at the beginning of a line, it indicates the start of section 'a'. By default, all text following the ".A" will be copied as part of section 'a' until the next section is found. Similarly, lines 17,18 indicate text following ".B" are part of section 'b', and so on for pairs 19,20 and 21,22. Lines 23,24 say that if the string ".O" is encountered, all text following shold be discarded. (Note that since the text is discarded, it's not considered part of any section). Finally, going back to lines 11-14, if a line beginning with a ".I" is encountered, text following it on the same line should be discarded. However, the oneline_flag set in line 13 indicates that this section only lasts for the current line (a default action of discarding text will take place if the next section does not begin on the next line). Also, if a ".I" is found, then a new document is started.

If your documents happen to fit into a category handled by the generic preparser, you're all set. Otherwise, you'll have to either write your own preparsing procedure within SMART to recognize your document format, or write a program to convert your documents to a standard format before they're even presented to SMART. The generic preparser accepts, as a parameter value, a filter program to be run before any preparsing action is done (examples of filter programs might be "uncompress" or "deroff"). If you need to write your own preparsing program within SMART, see "Doc/howto/modify" for how to incorporate your new procedure.

The preparsers in general are assumed to get a list of documents to be indexed from the file given by specification parameter doc_loc. If doc_loc is equal to "-", then the list of files is read from standard input. Thus the invocation for smart at indexing time is often of the form

  cd coll_dir; find $cwd -type f -print |\
     smart index.doc database/spec
See the samples in smart/Sample for more approaches.

Subtask 2: Finding and handling indexable information.

Once the document has been put into a standard format, it can be handled by the standard procedures in the rest of SMART. You need to describe which of the standard procedures should be invoked for which type of information found in the document. For example, if a name has been isolated, you may wish to normalize it to a standard format, and you do not want to run stemming on it. All of this is done within the parse description portion of the collection specification file. For each parsable section of the document given by the preparser, you can specify
  1. the parser to be used for this section
  2. For each parsetype of information (eg word, proper noun, number)
    1. what ctype (type of information) should be assigned
    2. what stemming procedure should be called
    3. what stopword procedure should be called
For each ctype you can specify (at least)
  1. What procedure maps a token to a concept number
  2. What procedure maps a concept number back to a token
  3. What procedure is used to weight this type of information
  4. What procedure is used to store this type of information
  5. What procedure is used to perform inverted retrieval
  6. What procedure is used to perform sequential retrieval
Luckily, all of these things have reasonable default values, otherwise the document description task would be very tedious.

The document description from the adi collection says there is only one type of final indexed information (line 41), 4 types of potentially indexable sections recognized by the preparser (line 27, names on lines 28,29,30,34), that no parsing is done on sections 0,1 that both sections 2 and 3 should use parsing method "index.parse_sect.full", and that only words and proper nouns from those sections should finally be indexed as ctype 0 (eg, numbers are ignored). By default, all tokens are rejected if they are on a stopword list, and if not, are then stemmed. Also, by default the documents are stored in both vector and inverted forms for experimental collections, but just in inverted file form for non-experimental collections.

You can run

        smprint proc index
to get a full list of all the indexing procedures available. To get the parameters each procedure actually uses, run
        docsmart 

Subtask 3. Query format

All of the same possibilities as above apply to queries as well as documents. The procedures to be called and their related parameters can be specified separately for queries. The parts of the document description giving the types of information must remain the same. In practice, the only departures from document indexing are in storage format (vector only) and possibly in choice of preparser.

An additional job when handling queries is the task of getting an interactive query from a user. Depending on options available for an interface, that can be done by specifying a query_skel, a text file that contains a query skeleton that the user can modify. The query skeleton is often just a stripped down document that the user can edit.

Subtask 4. Retrieval methods

For normal interactive use on a simple collection, specifying retrieval doesn't have to be done at all. An inverted-file inner-product match is automatically computed between query and each document, finding the best documents to show the user. This is a major part of experimental information retrieval, so there's already a number of retrieval options available, and the number should keep on increasing. But unless you are doing something unusual with your non-experimental collection, there's nothing to do.

Subtask 5. Display

The display of documents to the user has many similarities to the preparsing of the original documents. You have the freedom to write your own document printing routine and offer that as an option within SMART (eg, if you're working with TeX, your preparser probably called "detex", and your output routine may call "latex"). You can also just display the documents in whatever raw form they're stored in the filesystem. The most popular option generally is to display a formatted version of the preparsed document (the output of the collection dependent preparser). The formatting provisions are reasonably primitive, but are often sufficient. For example, the original text of pobj_text.c which includes the procedure description
 *0 print document texts
 *1 print.obj.doctext
 *2 print_obj_text (in_file, out_file, inst)
 *3   char *in_file;
 *3   char *out_file;
 *3   int inst;
 *4 init_print_obj_doctext (spec, unused)
 *5   "print.doc.indivtext"
 *5   "print.doc.textloc_file"
 *5   "print.doc.textloc_file.rmode"
 *5   "print.trace"
 *4 close_print_obj_text (inst)
 *6 global_start,global_end used to indicate what range of docs will be printed
 *7 The textloc relation "in_file" (if not VALID_FILE, then use textloc_file),
 *7 will be used to print all doc texts in that file (modulo global_start,
 *7 global_end).  Text output to go into file "out_file" (if not VALID_FILE,
 *7 then stdout).
 *8 Procedure indivtext gives format of doc text output.
when using docsmart will appear as
  DESCRIPTION:
  print document texts
  
  PROCEDURE:
  print_obj_text (in_file, out_file, inst)
    char *in_file;
    char *out_file;
    int inst;
  init_print_obj_doctext (spec, unused)
  close_print_obj_text (inst)
  
  HIERARCHY:
  print.obj.doctext
  
  USES:
    "print.doc.indivtext"
    "print.doc.textloc_file"
    "print.doc.textloc_file.rmode"
    "print.trace"
  global_start,global_end used to indicate what range of docs will be printed
  
  FULL DESCRIPTION:
  The textloc relation "in_file" (if not VALID_FILE, then use textloc_file),
  will be used to print all doc texts in that file (modulo global_start,
  global_end).  Text output to go into file "out_file" (if not VALID_FILE,
  then stdout).
  
  ALGORITHM:
  Procedure indivtext gives format of doc text output.
  
  BUGS AND WARNINGS:
which, if not any more understandable is at least prettier! The actual specification pair giving that is
  print.format "DESCRIPTION:\n%d\nPROCEDURE:\n%m%p%r\n
HIERARCHY:\n%h\nUSES:\n%s%g\nFULL DESCRIPTION:\n%f\n
ALGORITHM:\n%a\nBUGS AND WARNINGS:\n%b"
where the %x construct says to include the preparsed document section with name 'x' in the output at this point. (Only the first letter of section names is significant).

SMART filemodes

Another task that is normally done in the collection specification file is to determine the access modes for reading, writing and creating all of the database files. For example, you can bring a relation entirely into memory to work on it, or work on it from disk and bringing one entry into memory at a time, or (if MMAP is defined in src/h/param.h) map the file into memory, and let the operating system's paging algorithm bring the appropriate parts of the file into memory as needed. All of this is only an efficiency issue, but it can make the difference between the system being usable and being intolerable. Every database file accessed can have its own modes given in the spec file, but normally collection-wide defaults are given that are based on size characteristics of the collection. If the collection is several times larger than memory on the machine running smart, then normally disk based accesses are used. If the collection is small, then memory based is faster. If you have a large collection and have MMAP available, it should be used.

Disk based access is given by specification values
  rmode                   SRDONLY
  rwmode                  SRDWR
  rwcmode                 SRDWR|SCREATE
Memory based access is given by
  rmode                   SRDONLY|SINCORE
  rwmode                  SRDWR|SINCORE
  rwcmode                 SRDWR|SCREATE|SINCORE
Mmap access is given by
  rmode                   SRDONLY|SMMAP
  rwmode                  SRDWR
  rwcmode                 SRDWR|SCREATE
(Mmap access is currently only implemented for smart read
operations.)
If there is a specific file that you want to access in a different fashion than these defaults, that could be done
   query.textloc_file.rwmode  SRDWR|SINCORE
for example. Another parameter set that may be given in the spec file is advice about memory and swap space available for the program. Particular procedures (notably vec_inv) that want to use a lot of virtual memory can refer to these values as a guide. The defaults for vec_inv are set
  vec_inv.mem_usage       4194000
  vec_inv.virt_mem_usage  50000000
basically saying it should try to limit its own resident set size to about 4 Mbytes, and should allocate no more than 50 Mbytes of memory altogether. (For a large collection, this means vec_inv would have to write out intermediate lists to disk).

A final collection dependent parameter that could be set in the spec file is the size of the basic dictionary hash file. Again, this setting is only a question of efficiency, but an important question. Too small, and dictionary accesses become slow as overflow procedures have to be used. Too large, and space is wasted both in the dictionary and in the inverted file. Ideal is to end up with a dictionary with a bit more than half of its entries filled. (You can tell the state of the dictionary by

  smprint rel_header dict_file_name)
The default setting is reasonable for standard information retrieval test collections (~30000), but is too small for larger collections. Depending on exactly what is being indexed (eg, mail sources and destinations in an electronic news collection), values up to 500,000 (or larger) may be reasonable. Eventually, the basic implementation of dictionaries will change to something more reasonable. In the meantime, this particular hash-based approach works and is fast but is inflexible.

Actually creating a collection

To create a collection database, the normal procedure is
  1. Create the database directory
  2. Put your collection specification file there.
  3. Put other secondary files there (eg query_skel).
  4. Get a list of document files to indexed (normally put in file doc_loc
  5. Run "smart index.doc /spec"
Creating a test collection is basically the same process, except the query set and relevance assessments have to be put in the collection. So replace steps 4 and 5 with
  1. . List of document files put in file doc_loc List of query files put in file query_loc File with relevance judgments put in file qrels.text (These filenames are actually all given by specification parameters and can be named whatever you like as long as the specification values reflect those names)
  2. . Run "smart exp_coll /spec"
Other administrative tasks

SMART object copying.

Presently the SMART system does no reclaiming of "dead" space in inverted file or vector objects. This is a deliberate design decision, although it will change someday. It allows collection update to occur at the same time as users are working with the system, without any effort being spent on exclusion/synchronization problems. The drawback is that the size dynamic collection's files will grow without bound as documents are added or deleted.

The short-term solution is simply to copy these files from time to time, using the smart access procedures. The command

  smart convert /spec proc convert.obj.inv_inv in f1 out f2
will copy inverted file object f1 to f2. Similarly,
  smart convert /spec proc convert.obj.vec_vec in f1 out f2
  smart convert /spec proc convert.obj.textloc_textloc in f1 out f2
will copy a document vector file and a textloc file. The copied files then have to be moved back manually. All of this can be done automatically within SMART; I just haven't gotten around to writing the half page routine that will do it. In the mid to long term, more attention has to be paid to collection maintainance commands.

Adding documents to an existing collection

Normally, everything is all set for doing this. If new_doc_list is a file containing the filenames of all documents to be added, then
  smart index.doc /spec  doc_loc new_doc_list
will add those documents.

Deleting documents from an existing collection

Basically this still needs a lot of work. A simple variation of the conversion procedure vec_inv.c needs to be written to efficiently delete vectors from the appropriate inverted list. In the meantime, what is sufficient for our needs is to delete a whole list of documents at a time, and that can be done by copying the affected files while supplying the list of docs not to be copied. Eg, if the file $temp contains document ids to be deleted, the following 3 commands will delete them from an inverted file, a document vector file, and a textloc file (ie the file that maps doc ids to text location).
  # Copy inverted file, removing deleted docs
  smart convert spec proc convert.obj.inv_inv  \
                   in $database/inv.nnc out $database/inv.new \
                   deleted_doc_file $temp
  # Copy doc file, removing deleted docs (not needed for news)
  smart convert spec proc convert.obj.vec_vec  \
                  in $database/doc.nnn out $database/doc.new \
                  deleted_doc_file $temp
  # Copy textloc file, removing deleted docs
  smart convert spec proc convert.obj.textloc_textloc  \
                   in $database/textloc out $database/textloc.new \
                   deleted_doc_file $temp
  # Warning, the following operation may interfere with users already
  # running a retrieval (everything above should not).
  /bin/mv $database/textloc.new $database/textloc
  /bin/mv $database/textloc.new.var $database/textloc.var
  /bin/mv $database/doc.new $database/doc.nnc
  /bin/mv $database/doc.new.var $database/doc.nnc.var
  /bin/mv $database/inv.new $database/inv.nnc
  /bin/mv $database/inv.new.var $database/inv.nnc.var
Note that copying the files like this has the side benefit of compacting the files. Also note the warning above. There is currently no code within SMART to prevent somebody from running a retrieval at the same time the copied files are being moved around. That could be a problem.

Actually constructing the file $temp is done in a collection dependent fashion. If you want to remove documents from SMART whose text form has been removed, the following command will work

  # find which documents no longer have text files
  smart print $database/spec proc print.obj.did_nonvalid out $temp
Otherwise, you're pretty much on your own. See smart/Sample/update_news for a full updating script.