Storing files in a PostgreSQL database
This article extends an article about reasons to store user files in a database and not in a filesystem, and describes several ways how to do that in PostgreSQL. There are BLOBs, BYTEA columns, custom solutions, and all of them have advantages / disadvantages, so let's investigate them.
Well, the first solution you'll probably come up with, is using bytea columns. After all, these columns were designed for storinig binary data, right? Yes, it's a very efficient solution in most cases - just create a table, for example
CREATE TABLE Files ( id SERIAL PRIMARY KEY, filename VARCHAR(64) NOT NULL, filesize INT NOT NULL, data BYTEA NOT NULL, created TIMESTAMP NOT NULL, ... other metadata ... );
and then insert a file with all the data:
INSERT INTO Files VALUES (1, 'test_file.txt', 10, 'abcdefghij', ...);
Sure, this works ... unless you need to store large files, where "large" means exceeding the amount of memory you have. This is caused by the fact that you have to assemble the whole SQL INSERT command with all the file data, which may consume a lot of memory (and the need to escape the BYTEA data makes this even worse).
For example in PHP, there is a memory_limit option, usually set to 8MB (which is quite reasonable size), and you'll exceed this limit with a 2MB file (you have to escape the data before using it in INSERT). You may increase the memory_limit value a little, but on shared server the admin won't be very happy about that (and I completely understand his reasons).
A very messy workaround for this is to insert the data by "chunks" - use only the first 256kB in the first INSERT and then append the remaining data using UPDATE with 256kB chunks - something like this:
INSERT INTO Files VALUES (1, 'test_file.txt', 10, 256kB of data, ...); FOR c IN chunks LOOP UPDATE FILES SET data = data || 256kB of data WHERE id = 1; END LOOP;
But while this effectively solves the memory exhaustion problem, it creates a lot of dead tuples in the database (i.e. the autovacuum demon will have to clean all the updated versions of the row).
When reading the file from a database (e.g. in case of a download), you have to be careful about the memory limit too - if the file is small you may read it directly (and unescape it), but if the file is large, you have to read it piece by piece (using a substr function on the data column).
Conclusion: If you need to store just small files (and you know there won't be any large files), you may safely use BYTEA columns. If most of the files will be small but occasionally you'll need to store a large file, you may still use a BYTEA column with the UPDATE hack described above. But if there will be a lot of large files, BYTEA columns are not the best solution I guess.
Another PostgreSQL feature used for storing binary data is BLOB - binary large object. BLOB is not a column type - these objects are stored outside of a table (while yet in a system table) and referenced by an OID. PostgreSQL contains functions to manipulate BLOBs - lo_create, lo_open, lo_close, lo_import, lo_export etc. (see the PostgreSQL documentation) for details.
Moreover the PHP contains PDO with PostgreSQL LOB support, so it's possible to create BLOBs directly from PHP - see the example here. That means you can create a BLOB (and remember it's OID), upload all the data of the BLOB, and then use the OID in a table to reference the BLOB. You may actually create the BLOBs automatically when inserting into the main table.
What's so nice about BLOBs? There are no difficulties with memory limits, because the data are streamed to the server (notice the stream_copy_to_stream function in the examples) - just open the file as a stream, and let the system copy the data to the server.
What's not so nice about BLOBs? There's no referential integrity - you may delete a BLOB that is still referenced from a table. You may create AFTER UPDATE and AFTER DELETE triggers to remove orphaned BLOBs, or use "lo" contrib package. Anyway none of these solutions enforces the referential integrity and some looney may delete the BLOB directly, and that's why I finally came up with the following solution.
my solution ...
When designing a "perfect" solution for storing files, I've defined the following desired features:
- properly enforced referential integrity
- low memory requirements (no problems with memory limits)
- support for streaming (in ideal case)
And I've came up with solution based on two tables - "Files" and "File_data."
Table "Files" contains information about files (metadata) but not the data or any reference to filesystem or BLOB:
CREATE TABLE Files ( id SERIAL PRIMARY KEY, filename VARCHAR(64) NOT NULL, filesize INT NOT NULL, created TIMESTAMP NOT NULL, ... other metadata ... );
The data of the file are contained in the "File_data" table, split into segmens of arbitrary length:
CREATE TABLE File_data ( file_id INT, segment INT, data BYTEA, PRIMARY KEY (file_id, segment) ); ALTER TABLE File_data ADD FOREIGN KEY (file_id) REFERENCES Files(id);
Regarding the desired features mentioned above, the first two (transactional, referential integrity) are obviously met. The third (low memory requirements) depends solely on the segment size - the segment size implies the amount of memory required) - the 8kB requires about 32kB of memory.
The last requirement (support for streaming) is not met - at least not literally as it is not possible to open the file as a stream (as is possible with BLOB). But when using sufficiently small segments, the effect is sufficient.