Storing files in a PostgreSQL database

This article extends an article about reasons to store user files in a database and not in a filesystem, and describes several ways how to do that in PostgreSQL. There are BLOBs, BYTEA columns, custom solutions, and all of them have advantages / disadvantages, so let's investigate them.

BYTEA columns

Well, the first solution you'll probably come up with, is using bytea columns. After all, these columns were designed for storinig binary data, right? Yes, it's a very efficient solution in most cases - just create a table, for example

CREATE TABLE Files (
    id       SERIAL      PRIMARY KEY,
    filename VARCHAR(64) NOT NULL,
    filesize INT         NOT NULL,
    data     BYTEA       NOT NULL,
    created  TIMESTAMP   NOT NULL,
    ... other metadata ...
);

and then insert a file with all the data:

INSERT INTO Files VALUES (1, 'test_file.txt', 10, 'abcdefghij', ...);

Sure, this works ... unless you need to store large files, where "large" means exceeding the amount of memory you have. This is caused by the fact that you have to assemble the whole SQL INSERT command with all the file data, which may consume a lot of memory (and the need to escape the BYTEA data makes this even worse).

For example in PHP, there is a memory_limit option, usually set to 8MB (which is quite reasonable size), and you'll exceed this limit with a 2MB file (you have to escape the data before using it in INSERT). You may increase the memory_limit value a little, but on shared server the admin won't be very happy about that (and I completely understand his reasons).

A very messy workaround for this is to insert the data by "chunks" - use only the first 256kB in the first INSERT and then append the remaining data using UPDATE with 256kB chunks - something like this:

INSERT INTO Files VALUES (1, 'test_file.txt', 10, 256kB of data, ...);

FOR c IN chunks LOOP
    UPDATE FILES SET data = data || 256kB of data WHERE id = 1;
END LOOP;

But while this effectively solves the memory exhaustion problem, it creates a lot of dead tuples in the database (i.e. the autovacuum demon will have to clean all the updated versions of the row).

When reading the file from a database (e.g. in case of a download), you have to be careful about the memory limit too - if the file is small you may read it directly (and unescape it), but if the file is large, you have to read it piece by piece (using a substr function on the data column).

Conclusion: If you need to store just small files (and you know there won't be any large files), you may safely use BYTEA columns. If most of the files will be small but occasionally you'll need to store a large file, you may still use a BYTEA column with the UPDATE hack described above. But if there will be a lot of large files, BYTEA columns are not the best solution I guess.

BLOBs

Another PostgreSQL feature used for storing binary data is BLOB - binary large object. BLOB is not a column type - these objects are stored outside of a table (while yet in a system table) and referenced by an OID. PostgreSQL contains functions to manipulate BLOBs - lo_create, lo_open, lo_close, lo_import, lo_export etc. (see the PostgreSQL documentation) for details.

Moreover the PHP contains PDO with PostgreSQL LOB support, so it's possible to create BLOBs directly from PHP - see the example here. That means you can create a BLOB (and remember it's OID), upload all the data of the BLOB, and then use the OID in a table to reference the BLOB. You may actually create the BLOBs automatically when inserting into the main table.

What's so nice about BLOBs? There are no difficulties with memory limits, because the data are streamed to the server (notice the stream_copy_to_stream function in the examples) - just open the file as a stream, and let the system copy the data to the server.

What's not so nice about BLOBs? There's no referential integrity - you may delete a BLOB that is still referenced from a table. You may create AFTER UPDATE and AFTER DELETE triggers to remove orphaned BLOBs, or use "lo" contrib package. Anyway none of these solutions enforces the referential integrity and some looney may delete the BLOB directly, and that's why I finally came up with the following solution.

my solution ...

When designing a "perfect" solution for storing files, I've defined the following desired features:

  • transactional
  • properly enforced referential integrity
  • low memory requirements (no problems with memory limits)
  • support for streaming (in ideal case)

And I've came up with solution based on two tables - "Files" and "File_data."

Table "Files" contains information about files (metadata) but not the data or any reference to filesystem or BLOB:

CREATE TABLE Files (
    id       SERIAL      PRIMARY KEY,
    filename VARCHAR(64) NOT NULL,
    filesize INT         NOT NULL,
    created  TIMESTAMP   NOT NULL,
    ... other metadata ...
);

The data of the file are contained in the "File_data" table, split into segmens of arbitrary length:

CREATE TABLE File_data (
    file_id  INT,
    segment  INT,
    data     BYTEA,
    PRIMARY KEY (file_id, segment)
);

ALTER TABLE File_data ADD FOREIGN KEY (file_id) REFERENCES Files(id);

Regarding the desired features mentioned above, the first two (transactional, referential integrity) are obviously met. The third (low memory requirements) depends solely on the segment size - the segment size implies the amount of memory required) - the 8kB requires about 32kB of memory.

The last requirement (support for streaming) is not met - at least not literally as it is not possible to open the file as a stream (as is possible with BLOB). But when using sufficiently small segments, the effect is sufficient.

Comments

PostgreSQL/PHP issue

Hi Fuzzy,

I’ve been having a lot of trouble getting an image to INSERT & SELECT from the Postgres db.

I’m using PostgreSQL version 8.1.15 and my field is bytea for the image data.

The file opens as a string and has 2 backslashes BEFORE entering the db – when I look in the db it has only 1 backslash between the data

And I’ve read that this is in HEX format.

So, the entering part is not the issue at the moment (unless I’ve entered it incorrectly of course), but the extraction is the main issue.

I’m using pg_unescape_bytea and it’s able to FINALLY get a string back, but I still have yet to figure out how this is converted back to a PNG image.

So, if anyone could help me I would be very grateful – as I’ve only got until Tuesday to finish this project!

My code for entering is:

$con = pg_connect("host=xxx dbname=xxx user=xxx password=xxx")or die("The connection could not be established".pg_last_error());

$data = file_get_contents('D:\\WAPPStack\\apache2\\htdocs\\images\\t ester.png');

if (!$data)

{

echo 'Could not read file.';

}

// Escape the binary data

$escaped = pg_escape_bytea($data);

echo var_dump($escaped);

// Insert it into the database

pg_query("INSERT INTO imageslo VALUES ('Testimg1', '{$escaped}')");

…which works perfectly fine and my code for extracting is:

$query = "SELECT imgstring FROM imageslo WHERE imgid='Testimg'" ;

// construct how the query is performed

$queryresult = pg_query($con, $query) ;

// fetch the results from the database for the bytea field

$image = pg_fetch_result($queryresult, 'imgstring')) ;

// get rid of the escape backslashes (or re-format the data)

$res = pg_unescape_bytea($image) ;

// Insert a header for the image

header ("Content-type: image/png") ;

// print out(echo) the image to screen

echo $res ;

I also read somewhere that I should have to convert the data back from HEX, but I haven’t a clue where to do this or if it’s necessary for version 8.1.15 – all the attempts I’ve made to use it have failed.

//echo base64_decode(file_get_contents($image['imgString'])));

Thanks in advance for any assistance!

Re: PostgreSQL/PHP issue

Hi,

in general it seems fine - doing fetch+unescape should be just fine (I've verified that and it works fine). The 'hex' is irrelevant, as this bytea format was introduced after 8.1.

I'd guess the problem is in magic_quotes_runtime (PHP option), which may scramble the data. I don't know what PHP version are you using / what settings, but this is a probable cause. Try to store the data into a file and then compare it to the original one.

BTW 8.1.15 is insanely old - not only the 8.1 branch is not supported for a long time, but there are numerous updates in that branch (8.1.23 being the last one). I strongly recommend you to update - ideally to a more recent branch.

New comment

All the comments have to be accepted, so there may be some delay between submitting and accepting (or rejecting) the comment. If you enter the e-mail address, you will be informed about acceptance or rejection.

Subject or body may not contain HTML tags - they will be automatically removed. Paragraphs may be separated using a newline (ENTER).

(optional)