TDM_UPDATE renews a text collection by updating the
correspoding term-document matrix.
A = TDM_UPDATE(FILENAME, UPDATE_STRUCT) returns the new
term - document matrix of the updated collection. FILENAME
defines the file (or files in case a directory is supplied)
containing the new documents, while UPDATE_STRUCT defines
the update structure returned by TMG. In case FILENAME
variable is empty, the collection is simply updated using
the options defined by UPDATE_STRUCT (for example, use
another term-weighting scheme).
[A, DICTIONARY] = TDM_UPDATE(FILENAME, UPDATE_STRUCT)
returns also the dictionary for the updated collection,
while [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZED_FACTORS]
= TDM_UPDATE(FILENAME, UPDATE_STRUCT) returns the vectors
of global weights for the dictionary and the normalization
factor for each document in case such a factor is used.
If normalization is not used TDM_UPDATE returns a vector
of all ones.
[A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS,
WORDS_PER_DOC] = TDM_UPDATE(FILENAME, UPDATE_STRUCT) returns
statistics for each document, i.e. the number of terms for
each document.
[A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS,
WORDS_PER_DOC, TITLES, FILES] = TDM_UPDATE(FILENAME,
UPDATE_STRUCT) returns in FILES the filenames contained in
directory (or file) FILENAME and a cell array (TITLES) that
containes a declaratory title for each document, as well as
the document's first line.
Finally [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS,
WORDS_PER_DOC, TITLES, FILES, UPDATE_STRUCT] =
TDM_UPDATE(FILENAME, UPDATE_STRUCT) returns the update
structure that keeps the essential information for the
collection' s update (or downdate).
TDM_UPDATE(FILENAME, UPDATE_STRUCT, OPTIONS) defines optional
parameters:
- OPTIONS.delimiter: The delimiter between documents within
the same file. Possible values are 'emptyline' (default),
'none_delimiter' (treats each file as a single document)
or any other string.
- OPTIONS.line_delimiter: Defines if the delimiter takes a
whole line of text (default, 1) or not.
- OPTIONS.update_step: The step used for the incremental
built of the inverted index (default 10,000).
- OPTIONS.dsp: Displays results (default 1) or not (0) to
the command window.
- OPTIONS.remove_num: Indicates if we remove the numbers from the
dictionary (value 1) or not (value 0- default).
- OPTIONS.remove_al: Indicates if we remove the alphanumerics from
the dictionary (value 1) or not (value 0- default).
- OPTIONS.parse_subd: Indicates if we parse all the subdirectories
without be questioned (value 1), or we are asked which
subdirectories to parse (value 0-default). This option is
recommended for large collections with many subdirectories
so that they can be run in batch mode. Setting this options we
are avoiding questions during the parsing.
Copyright 2011 Dimitrios Zeimpekis, Eugenia Maria Kontopoulou,
Efstratios Gallopoulos