parameter_name parameter_value
where parameter_value can take on many types including procedure.
Here's the spec file for the adi collection (from
smart/src/test_adi/indexed.good/spec, line numbers added here).
1 ## INFORMATION LOCATIONS 2 database /home/smart/smart.11.0/src/test_adi/indexed 3 include_file /home/smart/smart.11.0/lib/spec.expcoll 4 doc_loc /home/smart/smart.11.0/src/test_adi/indexed/doc_loc 5 query_loc /home/smart/smart.11.0/src/test_adi/indexed/query_loc 6 qrels_text_file /home/smart/smart.11.0/src/test_adi/coll/qrels.text 7 8 ## ADI DOCDESC 9 #### GENERIC PREPARSER 10 num_pp_sections 6 11 pp_section.0.string ".I" 12 pp_section.0.action discard 13 pp_section.0.oneline_flag true 14 pp_section.0.newdoc_flag true 15 pp_section.1.string ".A" 16 pp_section.1.section_name a 17 pp_section.2.string ".B" 18 pp_section.2.section_name b 19 pp_section.3.string ".W" 20 pp_section.3.section_name w 21 pp_section.4.string ".T" 22 pp_section.4.section_name t 23 pp_section.5.string ".O" 24 pp_section.5.action discard 25 26 #### DESCRIPTION OF PARSE INPUT 27 index.num_sections 4 28 index.section.0.name a 29 index.section.1.name b 30 index.section.2.name w 31 index.section.2.method index.parse_sect.full 32 index.section.2.word.ctype 0 33 index.section.2.proper.ctype 0 34 index.section.3.name t 35 index.section.3.method index.parse_sect.full 36 index.section.3.word.ctype 0 37 index.section.3.proper.ctype 0 38 title_section 3 39 40 #### DESCRIPTION OF FINAL VECTORS 41 num_ctypes 1 42 43 ## ALTERATIONS OF STANDARD PARAMETERS 44 dict_file_size 3563 45 46 ## ALTERATIONS OF STANDARD PROCEDURESLine 2 gives the pathname of the database directory. All non-located (ie, not beginning with a '.' or '/') filenames are assumed to be relative to this directory. (Note: this is almost guaranteed to pop up unexpectedly and catch an experimenter eventually!)
Line 3 gives the location of another specification file that contains all the standard defaults for (in this case) an experimental test collection. Most non-experimental collections will include the file .../smart/lib/spec.default instead.
Lines 4-6 give the location of information needed to create the collection. In particular, line 4 tells where to find a file that contains in it the location of the document texts to be indexed. Line 5 does the same for the canned query set (this line would not be included for non-experimental collections). Line 6 tells where to find the relevence judgements (again, only for an experimental collection).
Lines 8-24 describe the text document format (gone into in more detail below). This tells how to convert the original document/query into a standard format that has the document broken up into sections.
Lines 26-38 describes the standard format (also gone into more detail below). This tells what parsing action should be done on each section of the document/query.
Line 38 says that the beginning of section 3 should be used as the title to display to the user for an interactive query.
Line 41 tells how many types of information are in each indexed document vector.
Line 44 says the initial size of the dictionary is 3563. This is discussed below.
There is a lot of information needed by smart not given here; it is hidden in the default spec files included. A collection spec file contains only the information specific to that collection.
Lines 9-24 in the adi collection above give one of those examples.
Lines 15 and 16 say that if the string ".A" is encountered at the beginning of a line, it indicates the start of section 'a'. By default, all text following the ".A" will be copied as part of section 'a' until the next section is found. Similarly, lines 17,18 indicate text following ".B" are part of section 'b', and so on for pairs 19,20 and 21,22. Lines 23,24 say that if the string ".O" is encountered, all text following shold be discarded. (Note that since the text is discarded, it's not considered part of any section). Finally, going back to lines 11-14, if a line beginning with a ".I" is encountered, text following it on the same line should be discarded. However, the oneline_flag set in line 13 indicates that this section only lasts for the current line (a default action of discarding text will take place if the next section does not begin on the next line). Also, if a ".I" is found, then a new document is started.
If your documents happen to fit into a category handled by the generic preparser, you're all set. Otherwise, you'll have to either write your own preparsing procedure within SMART to recognize your document format, or write a program to convert your documents to a standard format before they're even presented to SMART. The generic preparser accepts, as a parameter value, a filter program to be run before any preparsing action is done (examples of filter programs might be "uncompress" or "deroff"). If you need to write your own preparsing program within SMART, see "Doc/howto/modify" for how to incorporate your new procedure.
The preparsers in general are assumed to get a list of documents to be indexed from the file given by specification parameter doc_loc. If doc_loc is equal to "-", then the list of files is read from standard input. Thus the invocation for smart at indexing time is often of the form
cd coll_dir; find $cwd -type f -print |\
smart index.doc database/spec
See the samples in smart/Sample for more approaches.
The document description from the adi collection says there is only one type of final indexed information (line 41), 4 types of potentially indexable sections recognized by the preparser (line 27, names on lines 28,29,30,34), that no parsing is done on sections 0,1 that both sections 2 and 3 should use parsing method "index.parse_sect.full", and that only words and proper nouns from those sections should finally be indexed as ctype 0 (eg, numbers are ignored). By default, all tokens are rejected if they are on a stopword list, and if not, are then stemmed. Also, by default the documents are stored in both vector and inverted forms for experimental collections, but just in inverted file form for non-experimental collections.
You can run
smprint proc index
to get a full list of all the indexing procedures available.
To get the parameters each procedure actually uses, run
docsmart
An additional job when handling queries is the task of getting an interactive query from a user. Depending on options available for an interface, that can be done by specifying a query_skel, a text file that contains a query skeleton that the user can modify. The query skeleton is often just a stripped down document that the user can edit.
*0 print document texts *1 print.obj.doctext *2 print_obj_text (in_file, out_file, inst) *3 char *in_file; *3 char *out_file; *3 int inst; *4 init_print_obj_doctext (spec, unused) *5 "print.doc.indivtext" *5 "print.doc.textloc_file" *5 "print.doc.textloc_file.rmode" *5 "print.trace" *4 close_print_obj_text (inst) *6 global_start,global_end used to indicate what range of docs will be printed *7 The textloc relation "in_file" (if not VALID_FILE, then use textloc_file), *7 will be used to print all doc texts in that file (modulo global_start, *7 global_end). Text output to go into file "out_file" (if not VALID_FILE, *7 then stdout). *8 Procedure indivtext gives format of doc text output.when using docsmart will appear as
DESCRIPTION:
print document texts
PROCEDURE:
print_obj_text (in_file, out_file, inst)
char *in_file;
char *out_file;
int inst;
init_print_obj_doctext (spec, unused)
close_print_obj_text (inst)
HIERARCHY:
print.obj.doctext
USES:
"print.doc.indivtext"
"print.doc.textloc_file"
"print.doc.textloc_file.rmode"
"print.trace"
global_start,global_end used to indicate what range of docs will be printed
FULL DESCRIPTION:
The textloc relation "in_file" (if not VALID_FILE, then use textloc_file),
will be used to print all doc texts in that file (modulo global_start,
global_end). Text output to go into file "out_file" (if not VALID_FILE,
then stdout).
ALGORITHM:
Procedure indivtext gives format of doc text output.
BUGS AND WARNINGS:
which, if not any more understandable is at least prettier! The
actual specification pair giving that is
print.format "DESCRIPTION:\n%d\nPROCEDURE:\n%m%p%r\n HIERARCHY:\n%h\nUSES:\n%s%g\nFULL DESCRIPTION:\n%f\n ALGORITHM:\n%a\nBUGS AND WARNINGS:\n%b"where the %x construct says to include the preparsed document section with name 'x' in the output at this point. (Only the first letter of section names is significant).
Disk based access is given by specification values rmode SRDONLY rwmode SRDWR rwcmode SRDWR|SCREATE Memory based access is given by rmode SRDONLY|SINCORE rwmode SRDWR|SINCORE rwcmode SRDWR|SCREATE|SINCORE Mmap access is given by rmode SRDONLY|SMMAP rwmode SRDWR rwcmode SRDWR|SCREATE (Mmap access is currently only implemented for smart read operations.)If there is a specific file that you want to access in a different fashion than these defaults, that could be done
query.textloc_file.rwmode SRDWR|SINCOREfor example. Another parameter set that may be given in the spec file is advice about memory and swap space available for the program. Particular procedures (notably vec_inv) that want to use a lot of virtual memory can refer to these values as a guide. The defaults for vec_inv are set
vec_inv.mem_usage 4194000 vec_inv.virt_mem_usage 50000000basically saying it should try to limit its own resident set size to about 4 Mbytes, and should allocate no more than 50 Mbytes of memory altogether. (For a large collection, this means vec_inv would have to write out intermediate lists to disk).
A final collection dependent parameter that could be set in the spec file is the size of the basic dictionary hash file. Again, this setting is only a question of efficiency, but an important question. Too small, and dictionary accesses become slow as overflow procedures have to be used. Too large, and space is wasted both in the dictionary and in the inverted file. Ideal is to end up with a dictionary with a bit more than half of its entries filled. (You can tell the state of the dictionary by
smprint rel_header dict_file_name)The default setting is reasonable for standard information retrieval test collections (~30000), but is too small for larger collections. Depending on exactly what is being indexed (eg, mail sources and destinations in an electronic news collection), values up to 500,000 (or larger) may be reasonable. Eventually, the basic implementation of dictionaries will change to something more reasonable. In the meantime, this particular hash-based approach works and is fast but is inflexible.
The short-term solution is simply to copy these files from time to time, using the smart access procedures. The command
smart convertwill copy inverted file object f1 to f2. Similarly,/spec proc convert.obj.inv_inv in f1 out f2
smart convertwill copy a document vector file and a textloc file. The copied files then have to be moved back manually. All of this can be done automatically within SMART; I just haven't gotten around to writing the half page routine that will do it. In the mid to long term, more attention has to be paid to collection maintainance commands./spec proc convert.obj.vec_vec in f1 out f2 smart convert /spec proc convert.obj.textloc_textloc in f1 out f2
smart index.docwill add those documents./spec doc_loc new_doc_list
# Copy inverted file, removing deleted docs
smart convert spec proc convert.obj.inv_inv \
in $database/inv.nnc out $database/inv.new \
deleted_doc_file $temp
# Copy doc file, removing deleted docs (not needed for news)
smart convert spec proc convert.obj.vec_vec \
in $database/doc.nnn out $database/doc.new \
deleted_doc_file $temp
# Copy textloc file, removing deleted docs
smart convert spec proc convert.obj.textloc_textloc \
in $database/textloc out $database/textloc.new \
deleted_doc_file $temp
# Warning, the following operation may interfere with users already
# running a retrieval (everything above should not).
/bin/mv $database/textloc.new $database/textloc
/bin/mv $database/textloc.new.var $database/textloc.var
/bin/mv $database/doc.new $database/doc.nnc
/bin/mv $database/doc.new.var $database/doc.nnc.var
/bin/mv $database/inv.new $database/inv.nnc
/bin/mv $database/inv.new.var $database/inv.nnc.var
Note that copying the files like this has the side benefit of
compacting the files. Also note the warning above. There is
currently no code within SMART to prevent somebody from running a
retrieval at the same time the copied files are being moved around.
That could be a problem.
Actually constructing the file $temp is done in a collection dependent fashion. If you want to remove documents from SMART whose text form has been removed, the following command will work
# find which documents no longer have text files smart print $database/spec proc print.obj.did_nonvalid out $tempOtherwise, you're pretty much on your own. See smart/Sample/update_news for a full updating script.