regex

Grep duplicate JSON keys

If you have large JSON files with application settings in, you need to be sure that the settings only appear once. Not a problem until you get to the stage of very large files, being edited by all sorts of people manually.

[
"setting_1" : "some value",
"setting_2" : "another value",
"setting_1" : "different again"
]

Run a script to check for duplicate key names:

grep -Po '"[a-z_0-9]+"[ ]*:' <filename> | uniq -d

The above will output the duplicated setting(s) if any, to the console. Tested on Ubuntu 12.04

Retrieve email using regex

This horrendous regular expression will parse a string and return a valid email address from it.

$email = "<'Freddy'> fred@live.com";
preg_match('/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/', $email, $match);
echo $match[0];

This will return:

fred@live.com

Basically, if you pass an array as the third parameter of the preg_match method, it will be filled with the preg_match results, and the first item of the array will be the matching string. If you use capturing groups, these will also be filled. Read more about preg_match at the PHP site here.

I am told that this expression will match 99.9% of valid email addresses in the wild.

Regex – “The” searching

Say you have a list of movie titles, and you want to either sort them, or search through them, and some of them have “The ” at the start, for example:

  • The Simpsons
  • Simpson Street

When doing a MySQL search:

SELECT * FROM movies WHERE title LIKE "The Simp%";

Would only return the first row. But if you are working in a company where there is no standard set, the movie title could be formatted as “Simpsons, The” – and then, it won’t be found.

To solve this, you could replace the “The ” letters with blank, and then sort out the field contents during the query:

$str_query = preg_replace("/(title like "(the )(.*)%")/i",
    "REPLACE(LOWER(title), "the ", "") LIKE ("$3%")",    
    $str_query);

This will change :

"(title LIKE "The Simpsons")"

to,

"(title LIKE "Simpsons")"

But, the (the) in line 2 tells PHP to only replace it if starts with “The ” (case insenstive).

However, what if you want to search for “the”  (not sure why you would…)

You need to do a negative lookahead, to tell the expression to only carry on, if the search phrase is not exactly “the”

if (preg_match("/(title like "(?!the)(.*)%")/i", $str_query)) {

The (?!the) is the readahead.

(.*) matches any string but it is greedy and you have to be carfeul that it doesn’t just accept everything to the end of $str_query. (but its okay in our case, as we are looking for % (the LIKE wildcard))

After all this, we can run:

SELECT * FROM movies WHERE $str_query;

But what about sorting? All the titles beginning with “The” will appear in the T section. Whereas really, we want the Simpsons to appear in the S section.

Add an easy ORDER BY clause here:

SELECT * FROM movies WHERE $str_query ORDER BY (REPLACE(title, "the ", "") ASC;

Sorted!