Intro to A Regular Expression, a CSV Story

Regular expressions… sounds like a complicated term doesn’t it? You better believe that when I drop the term at an office dinner party, others either immediately tune out or start saying “Oh hey so and so, good to see you” and then walk off pretending to say hi. Little do they know that the idea behind regular expressions is pretty simple and can actually make their lives easier. But hey, that is what you, as the programmer, are hired to do right? Protect them from themselves and make their work easier. The oil in the company’s machine. One of the tasks they may have you do is work with CSV files. CSV files are comma separated files much like you would see working with Excel. I have had many people ask questions about Excel, so I will give a nice demonstration of using regular expressions to validate a CSV for the correct number of fields. At the end it will show which records may be missing data resulting in a poor or failed import. Get ready for an action packed entry with thrills and spills and maybe someone running around on fire… right here on the Programming Underground!

For those of you who are new to the world of CSV files, they are usually text files which have fields separated by commas (aka comma separated or delimited files). These types of files are great for importing and exporting data between databases or spreadsheet applications like Microsoft Excel. You may have heard of other file types that fit into this group like tab delimited or space delimited files. If you are in the need of transferring files between something like SQL Server and MySQL, you could look into exporting record data as a CSV file. Each line of the file will then contain all the data for each record, separated by a comma. Here is an example…

"""The Hammer""","Johnson","Taco",3,172

Such a record may represent a wrestler named Taco “The Hammer” Johnson. He might have three legs (yeah he is a tripod, but don’t make fun of him for that) and is 172 cm tall. Notice how some of the fields are enclosed in quotes and others are not. Quoted material often represents string data like words or sentences while the number data represents… well… numbers.

We will use this record to show how we can apply a regular expression to validate it for the correct number of fields (5) and if the record doesn’t have that many fields, we need to know because that line may be corrupt and result in bad import. That isn’t to say that the field isn’t just empty, but at least this will detect empty fields as fields and will not flag it as a problem.

Now regular expressions are used for pattern matching. They are first setup with a pattern which it then tries to find in the data given. It may match zero times, one time or more than once. If it matches one time or more, it often creates what is called a “matches” array or collection depending on the language. This blog entry will not go into depth on all the pieces of a pattern, but it will explain to you how to construct one pattern to solve our current CSV problem.

Each language out there has a slightly different syntax on implementing regular expressions. .NET may implement a regular expression through an actual Regex object while javascript tends to use the pattern in functions, like replace(), which is a method of a string value. Either way, once you understand regular expressions the little syntax variations between languages are small enough to quickly adapt to and get the job done.

Our example below is going to be written in C#, but for a change of pace, I am also going to show you the difference in VBScript. So lets dive right in!

using System;
using System.IO;

// Notice the namespace for using regular expressions. .NET has an entire namespace dedicated to the topic.
using System.Text.RegularExpressions;


namespace experimentalconsole
{
	class Program
	{
		static void Main(string[] args)
		{
			// Create an instance of our class and run it
			checkCSV csv = new checkCSV();
			csv.Run("c:\\test.csv");
		}
		
	}

	class checkCSV
	{
		// Lets setup our pattern first
		private static string csvRegexPattern = @"([^,""]+|""([^""]|"""")*""|,,)";

		// Lets now setup a Regex object using the pattern
		// This creates a Regex object using the pattern passed as a parameter to the constructor
		private static Regex _Regex = new Regex(csvRegexPattern);

		// A run method for our console app, passing it a filename to parse.
		public void Run(string AFile)
		{
			String lineRead;
			int lineNumber = 0;

			try
			{
				StreamReader sr = new StreamReader(AFile);

				// Loop through the file, reading each line
				while (null != (lineRead = sr.ReadLine()))
				{
					// Increment the line number
					lineNumber++;

					if (_Regex.Matches(lineRead).Count != 5)
					{
						Console.WriteLine("Possible bad record '{0}' at line {1}", lineRead, lineNumber);
					}
				}
			}
			catch(Exception e)
			{
				// If there was an error, lets see what it was.
				Console.WriteLine("File could not be found: {0}",e.Message.ToString());
			}
		}
	}
}

This code essentially moves through the CSV one record at a time, and using our regular expression, it matches the fields one by one. This results in a “count” at the end by which we can compare. Each record here is suppose to have at least 5 fields. They may be empty, but if there are at least 5 fields the record is considered good for importing into a table with five columns. If we use the following data…

"""The Hammer""","Johnson","Taco",3,172
"""The Sham""","Rodrigez","Pablo",171
"""The Wicked""","Martinez","Tino",,180

It will flag line two (Pablo Rodrigez) as an error because it is missing a field (the number of legs field). Even though Tino doesn’t have the number of legs set (it is an empty field) it is still seen as a good record because our table allows an empty field to be there.

Our while loop reads in the line from the CSV file and uses the Regex object (powered with our pattern) to find out how many matches were found. This number is the result of how many fields were matched. So we are looking for five in this case and, if it is not matching, we spit out an error line saying that we have a possible problem suspect at line “lineNumber”. We also spit out the line onto screen so we can see if indeed there is a problem.

The pattern is dissected like this… ” a double quote, followed by… ( a group that contains… [^,””] any character that isn’t a double quote… | or… “” an escaped double quote, followed by… ) the end of the group… * where the group is found any number of times ending in… ” a double quote. The trick to regular expressions is just to break them down into little pieces and read each one in order. Once you read it like this, you will find it makes a lot of sense. ([i]Use the reference link at the bottom of this blog entry to get the idea of all the pieces that go into a pattern.[/i])

As promised I have also included a VBScript version of this so you could see how regular expressions work in a scripting language.

Dim fso, sstream, reg, lineRead, lineNumber

' Create a fileSystemObject and open the text file to the CSV (passed as an argument to the script)
Set fso = CreateObject("Scripting.FileSystemObject")
Set sstream = fso.OpenTextFile(WScript.Arguments.Item(0),1,True)

' Create a regular expression object
Set reg = New RegExp

' Setup our pattern
Dim csvRegexPattern 
csvRegexPattern = "([^,""]+|""([^""]|"""")*"")"

reg.Pattern = "^" + csvRegexPattern + "," + csvRegexPattern + "," + csvRegexPattern + "," + csvRegexPattern + "," + csvRegexPattern + "$"
lineNumber = 0

Do while Not sstream.AtEndOfStream
	 lineRead = sstream.ReadLine()
	 lineNumber = lineNumber + 1
	 If reg.Test(lineRead) then
		  WScript.Echo "Found match: '" & lineRead & "' at line " & lineNumber
	 End if
Loop

sstream.Close

As you can see from the VBscript example we are setting up our usual pattern, but you have to specify it for the number of fields we are checking for. This dollar sign on the end basically makes it so that the pattern matches EXACTLY the line read in because we don’t have the idea of a matches array that we can check for a count. This is one of those subtle differences between the languages for a regular expression.

We then loop through the file as we did in the C# console example and echo any problems we have. You could keep this code in a vbs script file and run it against any excel csv documents you have and change the pattern to match the column count etc.

So there you have it, a minor minor introduction to a regular expression and one application you could use with it. If you want more information about using regular expressions and a place to try it out, you can try the following links…

Regular Expressions Reference at regular-expressions.info – Basic Syntax

Regular Expression Library Online Tester

Thanks again for reading and next time we will use tell you how to use the term of regular expressions to instantly crash a party! It should be lots of fun! But until then, enjoy!

🙂

About The Author

Martyr2 is the founder of the Coders Lexicon and author of the new ebooks “The Programmers Idea Book” and “Diagnosing the Problem” . He has been a programmer for over 18 years. He works for a hot application development company in Vancouver Canada which service some of the biggest telecoms in the world. He has won numerous awards for his mentoring in software development and contributes regularly to several communities around the web. He is an expert in numerous languages including .NET, PHP, C/C++, Java and more.