For space and time efficiency, most of SMART data files are stored in binary form. Thus they need special procedures / programs to be able to examine the contents of the files. All SMART binary data files are kept as relational objects, with the type of the relational object being needed in order to look at it. The full invocation of smart to print the tuples of object file OBJECT from database DATABASE of relational type TYPE which fall in the range STARTID, ENDID is
smart print DATABASE/spec proc print.obj.TYPE in OBJECT
global_start STARTID global_end ENDID
Luckily for those with normal memories, in most cases this can be
abbreviated
smprint -b STARTID -e ENDID -s SPEC TYPE OBJECT
with the startid and endid arguments being optional in both full
and abbreviated forms (Ie.
smprint TYPE OBJECT
prints the entire object OBJECT.)
There are two special arguments of smprint that deserve mention here. One is
smprint rel_header OBJECT
This gives the relational header for a SMART object of any type.
For example, smprint rel_header .../smart/src/text_adi/indexed/dict gives
num_entries 828 max_primary_value 3563 max_secondary_value 0 type 96 data_type 3 _entry_size 12 _internal 0of which the num_entries field and max_primary_value field are often of interest.
The last form of smprint is
smprint proc PREFIX
which doesn't really fit in with the other forms, but is so
useful that an exception was made. Instead of taking a SMART
object as the file to be printed, it takes the smart procedure
hierarchy, and prints out all the hierarchy names beginning with
PREFIX. Eg
smprint proc print.obj
will give all of the hierarchy names that begin with "print.obj", ie,
all legal values for TYPE.
In this write-up, the indexed adi database in .../src/test_adi/indexed is used as an example, with experimental runs from .../src/test_adi/test also being looked at.
For the adi collection in .../src/test_adi/indexed, the files below exist. Each file is given its type, a brief description of function, and then what output from smprint looks like
common_words DICT hash table of stop words not to be indexed
Concept Info Token
22 0 goes
dict DICT Mapping from token to concept number
Concept Info Token
2 0 zip
doc.atc VEC } Two file single relational object giving
doc.atc.var } indexed vectors with "atc" (tfidf-norm) weights.
Docid Ctype Concept Weight
1 0 15 0.089209
doc.nnn VEC Same as above, but with only tf weights
doc.nnn.var
Docid Ctype Concept Weight
1 0 15 1.000000
inv.atc INV Two file single relational object giving
inv.atc.var inverted file version of document.atc
Docid Ctype Concept Weight
42 0 2 0.250600
inv.nnn INV Same as above but for document.nnn
inv.nnn.var
Docid Ctype Concept Weight
42 0 2 1.000000
doc_loc --- Text giving location of documents to be indexed
q_textloc TEXTLOC Location information for each indexed query
q_textloc.var
Doc textloc 1
Title ''
File '/home/smart/src/test_adi/coll/query.text'
0 248
4
qrels RR_VEC Relevance judgments
qid did rank sim
1 17 0 0.000000
query.atc VEC Query vector set indexed using atc weights
query.atc.var
Queryid Ctype Concept Weight
1 0 19 0.372417
query.nnn VEC Same as above, but using only tf weights.
query.nnn.var
Queryid Ctype Concept Weight
1 0 19 1.000000
query_loc --- Text giving location of queries to be indexed
spec --- Text describing this collection
textloc TEXTLOC Location and title information for each
textloc.var indexed document.
Doc textloc 1
Title 'the ibm dsd technical information center - a total systems...'
File '/home/smart/src/test_adi/coll/adi.all'
0 1303
4
A typical real-life (non-experimental) indexed SMART collection may
only contain the files
smart print spec proc print.obj.doctext in textloc \
global_start 10 global_end 10
or
smprint -s spec -b 10 -e 10 doctext textloc
which prints the text of document 10 in whatever standard document
format is given in the spec file.
smart print spec proc print.obj.querytext in q_textloc \
global_start 10 global_end 10
or
smprint -s spec -b 10 -e 10 querytext q_textloc
which prints query 10.
smart print spec proc print.obj.vec_dict in query.nnn \
global_start 1 global_end 1
or
smprint -s spec -b 1 -e 1 vec_dict query.nnn
prints query vector 1 in the format
Queryid Ctype Concept Weight Token
1 0 19 1.00000 difficult
smprint tr_vec tr.atc
gives, for each of the documents returned to a user,
qid did rank rel action iter sim
1 4 15 0 0 0 0.065027
where "rel" gives the relevance of the document, "action" gives
what the user has done with the document, and "iter" gives the
feedback iteration of the run that retrieved this document.
(All examples in this section assume you are in
.../src/test_adi/test).
The other format gives information about the ranks of the relevant documents. This is the base for normal experimental evaluation methods. This is the RR_VEC format, giving
qid did rank sim
1 17 7 0.074443
for each of the relevant query,document pairs seen in this retrieval.
These two formats are the raw information produced by an experimental run. Either format can be passed through an evaluation process to produce numbers that can be compared across retrieval runs.
smart print spec.atc proc print.obj.rr_eval in spec.atc
or
smprint rr_eval spec.atc
will evaluate the run spec.atc using the ranks of the
relevant documents. An explanation of the output is found
in Doc/overview/app.eval.
The evaluation figures using RR_VEC information are more accurate than those using just TR_VEC information. But there are times when only TR_VEC information is available. For that reason,
smart print spec.atc proc print.obj.tr_eval in spec.atc
or
smprint tr_eval spec.atc
will also produce evaluation output.
In general, it's much handier to have head-to-head comparisons of more than one retrieval run. The parameter "in" can have a list of specification files as its value, separated by whitespace. Thus
smart print spec.atc proc print.obj.rr_eval in "spec.nnn spec.atc"
or
smprint rr_eval "spec.nnn spec.atc"
will produce
Relevant ranked evaluation
1. Doc weight == Query weight == nnn (term-freq)
2. Doc weight == Query weight == atc (augmented tfidf)
Run number: 1 2
Num_queries: 35 35
Total number of documents over all queries
Retrieved: 525 525
Relevant: 170 170
Rel_ret: 80 92
Trunc_ret: 476 457
Recall - Precision Averages:
at 0.00 0.4824 0.6726
at 0.10 0.4824 0.6726
at 0.20 0.4343 0.6261
at 0.30 0.4203 0.5785
at 0.40 0.3640 0.5146
at 0.50 0.3405 0.4998
at 0.60 0.2775 0.3799
at 0.70 0.2247 0.2928
at 0.80 0.2143 0.2652
at 0.90 0.1874 0.2380
at 1.00 0.1863 0.2361
Average precision for all points
11-pt Avg: 0.3286 0.4524
% Change: 37.7
Average precision for 3 intermediate points (0.20, 0.50, 0.80)
3-pt Avg: 0.3297 0.4637
% Change: 40.6
Recall:
Exact: 0.5036 0.6138
at 5 docs: 0.2718 0.4002
at 10 docs: 0.3915 0.5166
at 15 docs: 0.5036 0.6138
at 30 docs: 0.6675 0.7750
Precision:
Exact: 0.1524 0.1752
At 5 docs: 0.2457 0.3086
At 10 docs: 0.1686 0.2229
At 15 docs: 0.1524 0.1752
At 30 docs: 0.1048 0.1162
Truncated Precision:
Exact: 0.2064 0.2622
At 5 docs: 0.2629 0.3571
At 10 docs: 0.2061 0.2910
At 15 docs: 0.2064 0.2622
At 30 docs: 0.1896 0.2427