A Splash of Regular Expressions

Several times each year I find myself in need of regular expressions. And each time, I crack my knuckles, and start googling for a quickstart guide.

Today was one such day. A colleague sent me a comma-delimited text file of values, and I needed to weed out some duplicates, and reformat the remainder as XML. I opened the text file in Notepad++ and went to work.

A quick replace of "," with ",\n" yielded about 6,500 lines. A manual edit was going to take way too long. Regular expressions were going to be my friend.

First, to weed out the dupes. Fortunately, each dupe contained an underscore followed by a unique integer. Inserting "_\d+" into my existing expression matched up with all of the dupes, and replacing each match with nothing shortened my file to about 1,300 lines. Super.

\d{3}[+-np]{1}\d{5}_\d+,\W

Now all I needed to do was drop the comma, and surround the remaining values with <ID> and </ID> tags. Piece of cake, except I forgot to insert parentheses into the expression, and this caused me grief when trying to use "\1" in the “Replace with” box. Note to self: next time, don’t forget about the parentheses!

Find what: (\d{3}[+-np]{1}\d{5})(,)

Replace with: <ID>\1</ID>

The comma maps to "\2", and I didn’t use that in my result expression, so the comma simply disappeared.

The rest of the XML formatting was a piece of cake, adding header and footer tags.

Sources that helped me today:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: