FFD - Visit the site
ICMS

10 April 07

Beginner PHP Tutorial 12: Building Flat File Applications

When planning how I would write this article I considered simply using a similar previous article I wrote. However I decided against this since I have recently discovered faster and more memory efficient approaches while working on the third version of my own php flat file database application, FFD.

I’ll start by teaching you how to make a simple flat file application, for small amounts of data like in a guestbook. Then I’ll start explaining how you can develop larger applications (like a cms or a forum) to run on flat files, using more advanced methods to gain speed and memory benefits. Hopefully this tutorial will be a good follow up to the last on file management in php.

An Overview

Before you start coding your application you need to decide on the format of your flat files and how you are going to implement your flat file class. When deciding on the format one must consider that some formats will be more memory efficient than others, some may be faster and some may be easier to implement. It is important to select the format which has the benefits which are most important to your application. In a simplistic application such as a guestbook you might make a couple of functions for reading and writing data, however for a much larger application the approach is generally quite different, instead using a much larger group of functions (generally placed within a class, but don’t worry about this as it will be explained in a later tutorial) which implement more advanced techniques and provide a more abstract interface to the file, treating it more like a database object and less like a simple text file.

The File Format/Implementation

Fortunately PHP provides its own functions which make it far easier for developers to implement their own flat file applications. These functions are serialize() and unserialize() and what they do is convert a php array into a text equivalent and vice versa. So, if you wish to store a php array in a file you simply call serialize() and then write the result to the file like so:

$data = array('a', 'b', 'c');
$fh = fopen('file.txt', 'w');
fwrite($fh, serialize($data));
fclose($fh);

You can then read the file back into an array using unserialize (in just one line!) like so:

$data = unserialize(file_get_contents('file.txt'));

This is incredibly useful since it means all you have to do is decide on the format of the array. You can easily implement tables with columns and rows by combining two-dimensional arrays and this method. Here’s an example:

$data = array(
     array('column 1', 'column 2', 'column 3'),
     array('row 1', 'row 1', 'row 1'),
     array('row 2', 'row 2', 'row 2'),
     array('row 3', 'row 3', 'row 3')
);
$fh = fopen('file.txt', 'w');
fwrite($fh, serialize($data));
fclose($fh);

A real life example

This is all very nice but you may be unclear about how this can all be applied to real life applications. Well to show you I’m going to build a few functions that could be used in a guestbook to add a new post, delete a post and view all posts. For each post I’m going to store the author of the message and the content of the message. Here’s the function for adding a new post:

function add_post($file, $post_author, $post_content){
     $data = unserialize(file_get_contents($file));
     $data[] = array('author' => $post_author, 'content' => $post_content);
     $fh = fopen($file, 'w');
     fwrite($fh, serialize($data));
     fclose($fh);
}

It’s a really simple function, it just takes the file name, the name of the person who posted and the body of the post and adds them to the file. Here’s the function to delete a post:

function delete_post($file, $id){
     $data = unserialize(file_get_contents($file));
     if(isset($data[$id])){
          unset($data[$id]);
     }
     $fh = fopen($file, 'w');
     fwrite($fh, serialize($data));
     fclose($fh);
}

Note that the function takes a second parameter called ’$id’. This is a number which is designed to reference that post and no other post in the data file. For our simple application this is just the position/key of the post in the array. And finally, here’s the function to get all the posts:

function get_posts($file){
     return unserialize(file_get_contents($file));
}

This is by far the simplest function and all it does is get the data from the file and turn it into a valid PHP array. It would then be left up to the guestbook interface code to decide how to present the information and options to the user.

Optimizing for Memory and Speed

Note that a lot of this section is theory and I may talk about performing operations which you may not know how to perform. If you don’t plan on using flat files for large applications you don’t need to worry about this section. Just try to follow…all will be explained…

One of the fundamental problems with most flat file solutions written in PHP is that they are limited to small data sets. In fact many people believe it is impossible for flat file solutions to deal with the vast amount of data linked to modern PHP forums and CMS solutions. The truth is that while flat file applications can never be as fast as MySql, due to the overhead caused by PHP, it is perfectly possible for them to reach decent speeds and manage tens of thousands of rows effectively by implementing various optimizations.

Firstly, you need to make sure you are always dealing with as little data as possible at one time. Many flat file solutions are limited to small data sets because they read the entire data file into memory at once, so if you have a data file containing 1000 rows, which each contain a picture averaging at a size of about 1MB and your application reads the entire file into memory at once, you will encounter serious problems, the result being a massive loss of speed. With your application you should only have a single row in memory at once.

So if the application using your flat file library calls your select function (or similar) to get some data from the database, you should read in the records individually, performing actions like where condition testing on each row as it is read in. Admittedly this is slower for smaller tables than reading the whole file in at once and if you wish you can make your code check this to calculate the fastest (and yet still memory efficient) way to read the data in.

Another problem you might encounter is that you have nowhere to store your result set except in memory. However, as I do in FFD, you can use the tmpfile() function to get a handle to a temporary file, which you can then use as a store for your result data. When you want to put a row in the result set you simply write it to the temporary file, which you can read from later when the application wants to fetch the result data.

An important concept when dealing with large amounts of data is to only work with the data itself when you absolutely have to. Instead use positions and lengths to refer to the data. So you could store the exact position of a row in the data file and its length. This means that when reading the file you can move to the position in the file really quickly using fseek and then simply read the data in (again, really quickly) by using fread() and supplying the length.

In FFD I have two files for a table, one file contains all the information about the table, such as the columns and the positions of each of the rows, in php serialized format. In the second file is the actual data, in which the rows are simply placed one after another, with each row in php serialized format. This means that if I want to read a specific row I simply go to the first file, find the exact position and length of the row, which I then use to read the row data from the second file, which I then unserialize() to get the row. While the process can take a while to explain and can be quite complicated, the computer can do this quickly, and at the same time save memory by not having to read in the entire table, which could contain thousands of rows.

You could actually go further and store the exact positions of each of the fields within each row, however this actually takes a lot longer for the very little memory it saves, and can make reading from and writing to the table very complex and slow.

You can also use files called indexes that speed up WHERE searches, by removing the need to perform a full table search. An index stores data from usually one, but sometimes two or three columns. It stores this in a data structure (look on Wikipedia for B-Trees) which is optimised to allow quick searches. Once a result has been found the position of the record in the table file is with it, so the record can be read straight off quickly using a file seek. Indexes can also help with other parts of a SELECT query such as ORDER BY and LIMIT.

While indexes can speed up SELECT queries massively, each time the table is changed the indexes must be updated. For instance, a record might change position in the table data file, so the indexes would have to reflect this change. This means INSERT, UPDATE and DELETE statements (anything which changes the data) can be made slower by using indexes. However, if the indexes are only used on one or two columns and are kept small, the overhead is much less compared to the speed increase of the search.

While I’m not going to post any code on this, I hope you understand what I’m describing here and how you can implement it. Of course, these ideas are platform and language independent, and so are not limited to the realm of PHP.