python - compressing data with HDFStore -
I'm a newbie for pietsables and have a question about storing compressed panda dataframe. My current code is:
import pandas # HDF5 File name H5name = "C: \\ MyDir \\ MyHDF.h5" # Make HDF5 file store # pandas.io.pytables.HDFStore (H5name) # myDF.to_hdf (H5name, "myDFname" writes a Panda DataFrame to the created HDF5 file, attached = true) # Panda DataFrame HDF5 file from myDF1 = pandas.io.pytables.read_hdf (H5name, "myDFname created from Read back ") # Close the file store. Close ()
When I checked the size of HDF 5, the size (212 KB) was much larger than the original CSV file (58kb)
HDFStore (H5name, complevel = 1)
and the file size was not changed. I tried to do all completions
from 1 to 9 and the size still remained the same.
I tried to create
# HDF 5 file store - pandas.io.pytables.HDFStore (H5name, complevel = 1, complib = "zlib") But there was no change in compression.
What could be the problem?
In addition, ideally I would like to use a compression which R does for my saving function (like in my case 58kb file was saved in RDATA at 27kb size)? Do I need to do an additional numbering in Python to reduce the size?
Edit: I am using Python 3.3.3 and Panda 0.13.1
Edit Do: I tried a large file with a 487 MB CSV file, whose RData is size
(via the RS saved function) is 16 9 MB, for larger files, I have to compress See I Bzip2 gave the best compression of 202MB (level = 9) and the slowest to read / write. Blosc Compression (level = 9) gave the largest size of 276MB, but was too fast to write / read
Not sure that R does it differently in its save.
function, but it is both equally faster and more compressed than any of these compression algos.
If you actually have a small file, HDF 5 is basically a part of your data; Normally 64 KB is the minimum size size. Depedening on what data is, then it may not even be sec in that size.
You can try the msgpack
for a simple soln for this size data. HDF 5 is quite efficient for large size and will be well-contained.
Comments
Post a Comment