Regex to find where consecutive lines end with the same text
Tonight's tip features a regular expression script created by The fourth bird of the Netherlands on stack overflow. I posted looking for a regular expression that would find the text on a line repeated from the prior line after a time code at the beginning of the both lines that might be different. See this example:
(11:12:21) [Tom]: Hello this is Tom. Who is it?
(11:14:08) [Tom]: Hello this is Tom. Who is it?
The goal was to find when consecutive lines were the same after the first 10 characters. The fourth bird came up with a solution that would find when parts of two lines matched. In a text editor like NotePad++ run this find and replace search:
FIND: ^(\([^][]*\))(.*)(?:\r?\n\([^][]*\)\2)+
REPLACE: $1$2
^(\([^][]*\)) will find the first part of the string - the time code in parentheses. So the caret ^ matches the beginning of the line, and the rest then finds the rest of the text between the parentheses.
(.*) matches to the end of the line after the parenthetical information at the beginning.
(?:\r?\n this then matches a new group on a new line
\([^][]*\) this matches from the first part of the previous line.
\2)+ this then matches with the second part of the previous line.
As you can see in this demonstration a find and replace in the text editor can easily remove the duplicate lines.
