Data structures in python: maintaining filesystem structure within a database

Question

I have a data organization issue. I'm working on a client/server project where the server must maintain a copy of the client's filesystem structure inside of a database that resides on the server. The idea is to display the filesystem contents on the server side in an AJAX-ified web interface. Right now I'm simply uploading a list of files to the database where the files are dumped sequentially. The problem is how to recapture the filesystem structure on the server end once they're in the database. It doesn't seem feasible to reconstruct the parent->child structure on the server end by iterating through a huge list of files. However, when the file objects have no references to each other, that seems to be the only option.

I'm not entirely sure how to handle this. As near as I can tell, I would need to duplicate some type of filesystem data structure on the server side (in a Btree perhaps?) with objects maintaining pointers to their parents and/or children. I'm wondering if anyone has had any similar past experiences they could share, or maybe some helpful resources to point me in the right direction.

would it not be feasible to zip the directory and unzip it on the server? Then the filesystem structure on the client will be transported to filesystem structure on the server side. After all, a filesystem is a database of files. — Lie Ryan
– Lie Ryan, Commented Jul 19, 2012 at 6:10
Well, no, because I don't want to transfer the file data itself, just the filesystem structure. I need just the filesystem structure to be viewable in a tree-like layout from the server. I don't want the actual file data stored on the server. — blindsnowmobile
– blindsnowmobile, Commented Jul 19, 2012 at 6:24

pepr · Accepted Answer · 2012-07-19 06:48:16Z

I suggest to follow the Unix way. Each file is considered a stream of bytes, nothing more, nothing less. Each file is technically represented by a single structure called i-node (index node) that keeps all information related to the physical stream of the data (including attributes, ownership,...).

The i-node does not contain anything about the readable name. Each i-node is given a unique number (forever) that acts for the file as its technical name. You can use similar number to give the stream of bytes in database its unique identification. The i-nodes are stored on the disk in a separate contiguous section -- think about the array of i-node structures (in the abstract sense), or about the separate table in the database.

Back to the file. This way it is represented by unique number. For your database representation, the number will be the unique key. If you need the other i-node information (file attributes), you can add the other columns to the table. One column will be of the blob type, and it will represent the content of the file (the stream of bytes). For AJAX, I gues that the files will be rather small; so, you should not have a problem with the size limits of the blob.

So far, the files are stored in as a flat structure (as the physical disk is, and as the relational database is).

The structure of directory names and file names of the files are kept separately, in another files (kept in the same structure, together with the other files, represented also by their i-node). Basically, the directory file captures tuples (bare_name, i-node number). (This way the hard links are implemented in Unix -- two names are paired with the same i-none number.) The root directory file has to have a fixed technical identification -- i.e. the reserved i-node number.

Sergey · Accepted Answer · 2012-07-19 06:55:51Z

2

If by "database" you mean an SQL database, then the magic word you're looking for is "self-referential tables" or, alternatively "modified pre-ordered tree traversal" (MPTT)

Basically, the first approach deals with "nodes" which have id, parent_id and name attributes. So, to select the root-level directories you would do something like

SELECT id, name from mytable WHERE parent_id IS NULL AND kind="directory";

which let's assume returns you

[(1, "Documents and Settings"), (2, "Program Files"), (3, "Windows")]

then, to get directories inside "Documents and Settings" you issue another query:

SELECT id, name from mytable WHERE parent_id=1 AND kind="directory";

and so on. Simple!

MPTT is a little bit trickier but you'll find a good tutorial, for, example, in Wikipedia. This approach is very efficient for queries like "find all children of a given node", "how many files are in this directory including subdirectories" etc., and is less efficient when the tree changes as you'll need to re-order all the nodes.

Since you're using Python, you must to be using an ORM, you're not going to build those queries manually, right? SQLAlchemy is capable of modelling self-referential relations, including "eagerly loading" the tree up to a certain depth with a single query.

answered Jul 19, 2012 at 6:55

Sergey

12.5k4 gold badges43 silver badges54 bronze badges

1 Comment

blindsnowmobile Over a year ago

Yes, I'm using the Django ORM. Good information, thank you. This gives me some additional reading to do.

Collectives™ on Stack Overflow

Data structures in python: maintaining filesystem structure within a database

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related