iTAR	fast random access to compressed indexed TAR archives
	by Muayyad Saleh Alsadi<alsadi@gmail.com>
	Released under terms of GNU GPL
	Copyright © 2007, Arabeyes.org

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA

The archive format is compatible with GZ/BZiped TAR format
and can be extracted (or listed...etc) with unaware usual tools
but it generate an index file which makes it very fast to extract arbitrary files at arbitrary random positions

Currently the only supported compression is BZip2 as it's more efficient
And it only work in GNU/Linux (and maybe other UNIX) as it use GNU basename/alloca/bsearch/strn* functions, and it use POSIX mmap function family.
It's not difficult to port it even to MS-Windows as such functions could be replaced by workarounds (the difficult part ie mmap is already done by having malloc/fread/fwrite alternatives coded as suggested by GNU coding standards)
But I'm not interested in that (I hate reinventing the wheel).

It's possible to generate the index if missing
but this takes as long time as extracting the
whole archive file (eg. by detecting BZ_STREAM_END)

You can use this library in free software
you may also charge users for money
but you may NOT use it to develop a proprietary software (so called "commercial" software)
It is not released under LGPL, it's released under GPL
see the file COPYING for details

There is an included Python Module called iTar (.py) which let you extract member files, but yet it does not create iTar files.

The Python Module use different method to extract member files than the current C code, it extract the entire chunk that has the member file and keep it for future calls to extract other member files, while the C code don't use such method, BZ2 could be internally did that as chunks are choosen to be a single (almost) chunk. I think the method used in iTar python module is better and it could be implemented in C code soon (it allow faster reverse extracting)


Credits:
 extracting TAR file idea was taken from untgz.c (zlib/contrib/untgz/)
 which is not meant to be as powerful as GNU TAR
 untgz.c was written by "Pedro A. Aranda Guti\irrez" <paag@tid.es>
 adaptation to Unix by Jean-loup Gailly <jloup@gzip.org>
 various fixes by Cosmin Truta <cosmint@cs.ubbcluj.ro>

 some other ideas taken from GNU TAR info page.

WHERE DID THE IDEA COME FROM ?

I need some file format to be used in my Thwab.net project
An electronic encyclopedia system, with the size of a usual encyclopedia, a single XML file takes forever to be parsed and infinite memory, using some database is the usual choice, but as the user is not supposed to change the encyclopedia, I think of some fast random access archive, where the hierarchical tree of chapters and sections is represented through hierarchical directories, and contents are regular files, but what archive format to use.

TAR (a Unix/POSIX standard format) is not compressed at all, an is designed to be sequentially accessed, so a compressed TAR would be much worse. (to extract the last file you have to go through all previous files)

ZIP and a like, compresses every member file alone which degrade overall compression, and if we need such thing we could simply compress every single file before TARing them

bash$ find ./ -exec echo '{}' ';'
bash$ tar -cvjf ../file.tar.bz2 ./

but this is not efficient at all, as it degrade compression.

So I think of using a feature of Gzip/BZip2 files, that they could be concatenated, for example

bash$ cat part1.gz part2.gz > whole.gz
bash$ gunzip whole
bash$ cmp orig whole

So do TAR files, not really as there are EOF 0's mark,
but you can skip files block of a TAR file and the result will be a valid TAR file.

My first idea is quite simple, let's compress the file tree sorted (so that we know where every file is) and compress directories (to some depth) each in a single compressed TAR chunk,
for example if chunk level=1, then the file '001/001/009/012' and '001/099/039/017' are compressed in the same chunk
but '002/001/001/001' is on another chunk.
so that if a file in the second chunk is read we seek
through the file to skip it, and start decompression from the second chunk (which was created as a separated concatenated compressed block)

the more chunks the faster access time, the bigger file size

the chunk depth is set by trial to have good compression while having relatively good speed

I wrote some scripts to test this idea, and it was good, but not good enough

After some benchmarks on real data I though of a better solution.
So this was not how does it, it use a better algorithm


HOW DOES IT REALLY WORK?

The idea is quite simple, let the compressor tell us when to create the chunk! I have used low level BZ2 calls to know when it's efficient to create a new chunk, just as simple as that.

First we decide how fast we want it to be, let's say that computers decompress a compressed 32KB of data fast enough
and we afford waiting for that, and let's consider this as the worst case of compression size degrading (ie. no smaller chunks are allowed unless we are on the last chunk).

We could change this limit to any value depending on the speed or size we want

Second, while compressing, let the compressor tell us how much independent compressed data it has generated so far in this chunk, if it's larger than the chunk size limit, create a new chunk!

Benchmark
the old plain tar file size is 45762560 (43.64MB)
the indexed tar.bz2 file size using directory depth-based indexing (tar2ibz2) is 7864645 (7.5MB)
	and with index size of 365 bytes (29 entries)
the indexed tar.bz2 file size using new method (itar) is 7856991 (7.49MB)
	and with index size of 739 bytes (43 entries)
notice that the new method is very efficient and and will extract files fast because there are more indexed entries and yet the file is smaller!!!

the old indexed tar.bz2 file access time
time for i in `seq 10`; do itar lesaan.tar.bz2 '01/22/12' > /dev/null; done
real    0m1.961s
user    0m1.790s
sys     0m0.080s

the new indexed tar.bz2 file access time
time for i in `seq 10`; do itar lesaan.tar.bz2 '01/22/12' > /dev/null; done
real    0m0.825s
user    0m0.689s
sys     0m0.072s

Conclusion:
The iTAR format, is both smaller and faster than both file-based and depth-based chunks as the place of creating the chunk is decided based on the compressor state not based on related but irrelevant measures.

NEW:
Recently I have introduced a new extracting by file index family of functions just like as found in GNOME Structured File Library (GSF). One can extract a file by its index or by its name although once can't get the corresponding name for a given index or vice versa

Then I have added file list feature to the index file,
so that one can get the filename corresponding to a given name or vice versa easily (through a fast bsearch)

BUGS:
There should be some bugs of intermixing extracting by index with extracting by filename family of functions.
This will be solved soon.

KDE unaware tool (ark) could not handle this file format due to a bug in KDE as GNU TAR, MC and GNOME unaware tools work fine
