Sunday, September 27, 2009

how to shuffle data in out-of-core manner?

When the data is very huge and we cannot put all the data in the main
memory, the current shuffle methods that assume all the data resides in memory could no longer be used directly.

Here we use a method similar to the way in shuffling the cards ( similar to mergesort):
while( iteration < setvalue)
{
tmpfile_set = split(datafile)
datafile = merge(tmpfile_set)
delete(tmpfile_set)
iteration <= iteration + 1
}

Another way to do this is to use fopen64 function. This function is used to
open large files that could not be loaded into the memory at once. We can find
a way to shuffle the data as follows:

1. find a permutation of the [1...N], where N is the total number of records.
Suppose the permutation is [n1, n2, n3,..., nN]
2. try to put the ith record in the original file to the ni th place in the new file.

In this method, we would need to consider the following issues:

a. the fseek function is needed in step 2. If missing values exist, then the starting point of a record is difficult to find.

b. the running time of this algorithm seems a problem coz fseek may cross several blocks? I do not know how to analysis the time now.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.