Count Word Matches in a File With Our Friend Java

Ever wanted to dig through a text file and see how many times a particular word appears? Are you afraid the file is going to jump up and bite you and bite you hard?!? Don’t sweat it. That crazy file isn’t going to be too much trouble if you know some useful classes for dealing with the situation. In this entry we talk about opening a file using a BufferedReader, tokenizing it into words and comparing those to a word provided on the command line at the time the program is run. By all means this isn’t the only process for doing this, but it certainly has some flexibility to it. As with most of my examples, I try to leave it in a state where you could easily add on to it and grow it into something much more specialized. That is what we do here, help others on the Programming Underground!

The code below is run from the command line in the following way… java countwords . So for instance to search the test.txt file on the c:\ drive for the word “hello” we would use the line java countwords “c:\test.txt” “hello”.

The process is pretty straight forward. First we need to be able to access the file. That means we need to include the java.io package so that we can use the file classes BufferedReader, FileReader, and File. Then to help us break up the text, that we are going to read into the program, we need our StringTokenizer class. The job of the StringTokenizer class is to take in a line of text and break it into tokens. By default this class will break a string on the space character which creates word tokens. However, the class can take a delimiter (a character used for splitting a string) and break the string on any boundaries you like. If you are unfamiliar with StringTokenizer you can read up on it here.

Once we have those packages in place we can begin our program. It is seen below…

// Packages java.io to read files and the StringTokenizer class to tokenize each line read from the file.
import java.io.*;
import java.util.StringTokenizer;

public class countwords {
	public static void main(String args[]) {
	
		// Make sure that they have provided at LEAST two parameters. First being a file and the other being the word to find.
		if (args.length > 1) {
			// Check if the first parameter is a file.
			File searchfile = new File(args[0]);
			if (!searchfile.exists()) { System.out.println("Sorry, that unable to find that file"); System.exit(0);}

			try {
				// Set up our reader to read the file
				BufferedReader r = new BufferedReader(new FileReader(args[0]));
				
				String text = "";
				
				// Setup our counter to count word matches
				int counter = 0;
				
				// Loop through each line of the file and tokenize it. 
				while ((text = r.readLine()) != null) {
					StringTokenizer st = new StringTokenizer(text);
					
					// Loop through the tokens and see if they match the word specified by the user.
					while (st.hasMoreTokens()) {
						if (st.nextToken().equals(args[1])) {
							counter++;
						}
					}
				}
				
				r.close();
				
				// Let the user know how many matches were found
				System.out.println("Found the word: \"" + args[1] + "\" " + counter + " times.");
			}
			catch (IOException e) {
				// Print any error messages we happen to get related to opening and using the file.
				System.out.println(e.getMessage());
			}
		}
		else { 
			System.out.println("Please provide at least two arguments. The first is the file path and second is the word to find.");
		}
	}
}

If you follow along we first start by testing the incoming arguments. We want to test to make sure we have all of them and that they are legitimate. The incoming parameters are what are put into the args[] array in main (in case you were ever wondering what that args[] was for). So we can test the length of this array to know how many arguments were actually passed in. Note: Here is a great place to add extra code if you wanted to actually search for more than one word at a time and return the results of multiple matches.

Once we determine that we have at least two arguments, we test the first to see if indeed it is a file and that the program can locate it. We use the File class to do this because it has a real handy method called “exists()”. If the file is not found, we print the message to the user and abort the program. Feel free to call the user a dumbass if you like. Otherwise we go on to opening the file.

We open the file as a BufferedReader. The BufferedReader class can take any kind of “Reader” object. One of the useful ones is FileReader because it allows us to simply supply the path to the file and you are set to go. We can’t really use FileReader on its own because it has no methods for reading files. It is primarily used to setup the needed stream to the file and give it to one of the many file handling objects. This is why we use a BufferedReader here because it has a great set of functions for actually READING the file. We put all these file handling functions in a try catch statement to make sure that we catch any errors that may appear from handling the file (like not being able to open it due to file permissions etc).

After we have established a file, we setup a text variable to hold the line to will read and a counter variable to count our matches. We then go into a while loop which is in charge of reading each line of the file and storing it into our text variable. It will continue until it reaches the end of the file. On each iteration of the while, we pass the read line to our StringTokenizer. It breaks up the text into tokens that we can iterate through using the nextToken() method. In this case the nextToken method is going to loop through each word of the line and compare it to the word provided by the user at the start of the program.

On each match of the word, we increment our counter variable. At the end of the loop we then go back to reading in another line and the process continues all over again. It will then keep going until the file hits the end and there is no more left to read. Because of the way we do this, really long text files will take a bit of time to process. You may want to put in some kind of status indicator to let the user know how far along in the processing the program is. Just so they don’t think that the program froze up. Once we have all the matches, we simply report that to the user in a nice message (or a cruel one, it is up to you) and end the program.

This program is not at all that complex if you just follow through the code. It looks a bit longer than it really is just due to the try catching and some of the parameter checking. But hopefully you take a few things away from this example… 1) How to check for and validate some arguments passed into the program. 2) How to open and use a file using a BufferedReader/FileReader setup. 3) Use the StringTokenizer class to tokenize a string and get at each individual token. 4) Lastly, use a while loop to loop through a file until it reaches the end, all along comparing tokens to the word to find.

Hopefully this program will help with those school assignments out there that require you to access a file and either search for content or break up text to manipulate it. You can essentially take this program and rip out the while loops and you have a test frame for accessing files to run whatever experiment you want on them.

My code here is in the public domain and free for anyone to use or modify as they see fit, just like all the code on my blog. Having said that it may be a great idea to subscribe to this blog so you get the latest code I have written and can share with your friends, classmates, mortal enemies as you cause their computer to blurt out farting noises in class or whoever. Thanks for reading my blog and hope to write again for you soon. 🙂

About The Author

Martyr2 is the founder of the Coders Lexicon and author of the new ebooks "The Programmers Idea Book" and "Diagnosing the Problem" . He has been a programmer for over 25 years. He works for a hot application development company in Vancouver Canada which service some of the biggest tech companies in the world. He has won numerous awards for his mentoring in software development and contributes regularly to several communities around the web. He is an expert in numerous languages including .NET, PHP, C/C++, Java and more.