File Chunking and Merging in C++

I got the idea for a file chunking program the other day while I was reading a Ruby article. I thought to myself, why not build a similar utility in C++? The idea of chunking a file (that is, taking a file and breaking it into smaller files) can be a useful utility when dealing with some rather big files. Perhaps if you mix it with some file compression before the chunking, it would be even better! The need for such a utility like this has diminished over recent years because of high bandwidth and larger storage mediums. However, some programs (like email) may still benefit from something like this. I will show you a simple example of breaking a file apart and then merging it back again into the original file… all right here on the Programming Underground!

The idea behind the chunking itself is not a complex one. What we need to do is locate a file, open it up, read so many bytes out of it, create a file for each bit we read and write that piece into the file. We then close that chunk file and open another and read more of the input file. We repeat the process until we reach the end of the original file.

Our chunking function is going to take three pieces of information. The full path to the file we want to chunk, the name we want to give each chunk (it will then append the chunk number) and the chunk size which the user can specify. So for instance if the user wanted to take “c:\input.txt” which is 330 bytes and tell it to chunk in segments of 50 bytes named “chunk”, it is going to produce 7 files. 6 of those files will be 50 bytes and the last will be the remaining 30 bytes. They will be named “chunk.1”, “chunk.2”, “chunk.3” etc. I am sure you get the idea. 😉

We open the file input.txt and if everything is successful, we read out 50 bytes at a time and place them in new files. We continue the process until the input file reaches the end of the file. The code looks like this…

 
// Chunks a file by breaking it up into chunks of "chunkSize" bytes.
void chunkFile(char *fullFilePath, char *chunkName, unsigned long chunkSize) {
	ifstream fileStream;
	fileStream.open(fullFilePath, ios::in | ios::binary);

	// File open a success
	if (fileStream.is_open()) {
		ofstream output;
		int counter = 1;

		string fullChunkName;

		// Create a buffer to hold each chunk
		char *buffer = new char[chunkSize];

		// Keep reading until end of file
		while (!fileStream.eof()) {

			// Build the chunk file name. Usually drive:\\chunkName.ext.N
			// N represents the Nth chunk
			fullChunkName.clear();
			fullChunkName.append(chunkName);
			fullChunkName.append(".");

			// Convert counter integer into string and append to name.
			char intBuf[10];
			itoa(counter,intBuf,10);
			fullChunkName.append(intBuf);

			// Open new chunk file name for output
			output.open(fullChunkName.c_str(),ios::out | ios::trunc | ios::binary);

			// If chunk file opened successfully, read from input and 
			// write to output chunk. Then close.
			if (output.is_open()) { 
				fileStream.read(buffer,chunkSize);
				// gcount() returns number of bytes read from stream.
				output.write(buffer,fileStream.gcount());
				output.close();

				counter++;
			}
		}

		// Cleanup buffer
		delete(buffer);

		// Close input file stream.
		fileStream.close();
		cout << "Chunking complete! " << counter - 1 << " files created." << endl;
	}
	else { cout << "Error opening file!" << endl; }
}

As you can see, it is rather long and probably could be refactored a little. But stepping through it is not incredibly difficult. We open the input file, setup a buffer array which will hold each chunk read and then use that buffer to write to each file. The file name for each chunk is built using a string class (so make sure to include ) and then the chunk is written into it, closed and more input is read. The result of this function will yield the list of files ending in numbers.

Now once we have the chunks created, we will need a function that then takes each chunk file, opens it, reads it and puts it all into one file again.

As you can see we are opening each file in binary mode. This is so that we can not only read text files, but literally any file including jpg, exe files, dlls etc without having a serious problem or messing with bytes when we have a binary file.

The function to then merge the files together we call “joinFile”. I guess I could have called it “mergeFile” or “glueFile” or something funny like that. This function also has a little helper function called “getFileSize” which simply gets the size of the chunk file so that it knows how big to make the buffer to read in the chunk. It goes a little bit like this…

// Finds chunks by "chunkName" and creates file specified in fileOutput
void joinFile(char *chunkName, char *fileOutput) {
	string fileName;

	// Create our output file
	ofstream outputfile;
	outputfile.open(fileOutput, ios::out | ios::binary);

	// If successful, loop through chunks matching chunkName
	if (outputfile.is_open()) {
		bool filefound = true;
		int counter = 1;
		int fileSize = 0;

		while (filefound) {
			filefound = false;

			// Build the filename
			fileName.clear();
			fileName.append(chunkName);
			fileName.append(".");

			char intBuf[10];
			_itoa(counter,intBuf,10);
			fileName.append(intBuf);

			// Open chunk to read
			ifstream fileInput;
			fileInput.open(fileName.c_str(), ios::in | ios::binary);

			// If chunk opened successfully, read it and write it to 
			// output file.
			if (fileInput.is_open()) {
				filefound = true;
				fileSize = getFileSize(&fileInput);
				char *inputBuffer = new char[fileSize];

				fileInput.read(inputBuffer,fileSize);
				outputfile.write(inputBuffer,fileSize);
				delete(inputBuffer);

				fileInput.close();
			}
			counter++;
		}

		// Close output file.
		outputfile.close();

		cout << "File assembly complete!" << endl;
	}
	else { cout << "Error: Unable to open file for output." << endl; }

}

// Simply gets the file size of file.
int getFileSize(ifstream *file) {
	file->seekg(0,ios::end);
	int filesize = file->tellg();
	file->seekg(ios::beg);
	return filesize;
}

We give it the name of the chunk name we used in breaking the file and then the filename we want to merge the chunks as. We could improve this code a little where instead of attempting to read a buffer of “fileSize” we could work with a smaller buffer and keep reading until we reach the end of the chunk. You may want to think about such a change if you think the chunks will be extremely large (tens of megs or higher) and don’t want to burn up a lot of memory.

For this utility I would suggest not trying to create chunk files which are in the multiple gigabyte size range. Probably a few megs each would do nicely for chunking a file that is a couple gigs. The balance would be to create decent sized chunks which you could transport without having to create hundreds of chunks.

This utility would have been perfect for back in the day when we used diskettes that could only hold 1.44 megs each. I believe DOS had a similar utility like this, but I figured it would be a nice little theoretical program you could play with and put into your programs if you needed one.

I hope you enjoy it. As with all code on the programming underground, this is in the public domain and free to use as you see fit. Enjoy the code tidbits and thanks for reading! 🙂

About The Author

Martyr2 is the founder of the Coders Lexicon and author of the new ebooks "The Programmers Idea Book" and "Diagnosing the Problem" . He has been a programmer for over 25 years. He works for a hot application development company in Vancouver Canada which service some of the biggest tech companies in the world. He has won numerous awards for his mentoring in software development and contributes regularly to several communities around the web. He is an expert in numerous languages including .NET, PHP, C/C++, Java and more.