Fulltext with dictionaries in shared memory
If you're using the fulltext built-in to PostgreSQL and if you can't use solution based on snowball due to the nature of the language (as it works great for english, but there's not anything similar for czech with reasonable accuracy and probably never will be), you're somehow forced to use ispell based dictionaries. In that case you've probably noticed two annoying features.
For each connection (backend), the dictionaries are loaded into private memory, i.e. each connection spends a lot of CPU parsing the dictionary files on the first query, and it needs a lot of memory to store the same information. If the parsed dictionary needs 25MB (and that's not an exception) and if you do have 20 concurrent connections using the fulltext, you've suddenly lost 500 MB of RAM. That may seem like a negligible amount of RAM nowadays, but there are environments where this is significant (e.g. VPS or some cloud instances).
There are workarounds - e.g. using a connection pool with already initialized connections (you'll skip the initialization time but you're wasting memory) or keeping small number of persistent connections just for fulltext queries (but that's inconvenient to work with). There are probably other solutions but none of them is perfect ...
Recent issues with the fulltext forced me to write an extension that allows storing the dictionaries in the shared memory. Even this solution is not perfect (more on that later), but it's definitely a step in the right direction. So how does it work and what it does?
This extension is very fresh - I've developed it less than two days ago, so it's not as tested as it should be. Therefore it's not something you'd like to put on your production environment right now, but if you can, give it a try on your testing environment, with your dictionaries etc. and let me know in case of any issues (crashes, differences compared to plain ispell etc.).
The basic functionality is quite simple - during the database startup, the extension requests space in the shared segment (next to the shared buffers) to store the dictionaries. The amount of memory is determined by a GUC variable shared_ispell.max_size (the default size is 30MB).
The extension defines an "search template" called "shared_ispell," that may be used to define custom dictionaries like this:
CREATE TEXT SEARCH DICTIONARY czech_shared (
TEMPLATE = shared_ispell,
DictFile = czech,
AffFile = czech,
StopWords = czech
);
CREATE TEXT SEARCH CONFIGURATION public.czech_shared
( COPY = pg_catalog.simple );
ALTER TEXT SEARCH CONFIGURATION czech_shared
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH czech_shared;
Now you can use the config "czech_shared" just like the other ispell configurations, e.g.
SELECT ts_query('czech_shared', 'hledané výrazy')
All the beauty is hidden in the template - it works similar to "ispell", but instead of loading the dictionaries into private memory of each backend, it searches for them in the shared segment. If the dictionary is not yet loaded, it loads is just like the ispell and then copies it into the shared segment so that the other backends may use it too.
Segment size
One of the issues is how to determine the segment size - it's a static value (can't be modified at runtime) and it's unknown what dictionaries will be loaded later, not mentioning the size. If you know what languages you're going to use, you may set a large limit, load all the dictionaries and then see how much memory is actually needed (using the shared_ispell_mem_used() function). If you don't know the list of dictionaries, you'll have to experiment.
It's not possible to know how much memory will be needed by a dictionary without actually parsing it, and the current implementation simply attempts to copy the parsed result, so it may realize halfway through there's not enough memory. In that case it simply throws an error that there's not enough space in the shared segment.
It's not difficult to remove the dictionaries from memory - that's what shared_ispell_reset() function does. It allows you to start over, e.g. to load modified dictionary files from the disk and so on (existing sessions won't fail, the first request will reload the necessary dictionaries).
Splitting into parts
Another interesting option is that it's possible to split the dictionary into parts (dictionary, affixes and stop words). Those are loaded separately, they're not tightly bound together - there's no reason not to use one dictionary (DictFile) file with multiple affixes (AffFile) or stop words. In traditional implementation, this means separate initialization and keeping multiple copies of the same file in memory (AFAIK) as it simply loads several copies of the same file.
My plan is to share the files - if you define two dictionaries sharing files, e.g.:
CREATE TEXT SEARCH DICTIONARY czech_shared_1 (
TEMPLATE = shared_ispell,
DictFile = czech,
AffFile = czech,
StopWords = czech_stop_1
);
CREATE TEXT SEARCH DICTIONARY czech_shared_1 (
TEMPLATE = shared_ispell,
DictFile = czech,
AffFile = czech,
StopWords = czech_stop_2
);
the shared ones will be loaded just once (in this case DictFile and AffFile). Right now this nicely works for StopWords and it might be interesting for DictFile/AffFile too.
Update 5/1/2012: I've realized the dictionary and affixes are actually tightly bound, s it's not possible to load them separately.




