Most common repeated strings

TalkPurely Programmers

Join LibraryThing to post.

Most common repeated strings

This topic is currently marked as "dormant"—the last message is more than 90 days old. You can revive it by posting a reply.

1timspalding
Jul 25, 2009, 10:19am

I'm looking for an easy way to identify the most common repeated strings in some code. I've got a big hunk of HTML and JavaScript with many repetitions of very long class, id and function names (eg., "LibraryThing.collections.toggleBookForCollectionFromMenu"). I want to simplify it down by making them smaller.

I'd love a script, preferably in PHP, that showed the longest such strings, so I can approach the issue systematically. I'm guessing this is an easy problem, and that it's baked into algorithms like ZIP. I'm sure there's an elegant algorithm for it.

Anyone have a good solution?

2timspalding
Jul 25, 2009, 1:08pm

I tried it over on StackOverflow. The solution is VERY slow. I've got my own, longer and more annoying one that's a little faster, mostly because it goes word by word.

http://stackoverflow.com/questions/1182208/find-longest-repeating-strings

3Sander314
Jul 25, 2009, 1:12pm

Sounds like a variant on http://en.wikipedia.org/wiki/Longest_repeated_substring_problem
Not very easy to implement in the most general case.

In your case however, you can use something easier as you probably want strings matching a-zA-z0-9\.+ and in that case some kind of brute force regex+hash solution will suffice. Do you have some example input?

4kevinmcguinness
Jul 27, 2009, 7:39pm

There is a solution to a similar problem in Jon Bentley's Programming Pearls, Column 15. He uses a data structure called a suffix array to create a very efficient solution for plain text. You might need to extract what you're interested in from the program text first, but Bentley's solution should work.

Column 15 of the book is available on the books website: http://www.cs.bell-labs.com/cm/cs/pearls/strings.html

-Kevin

5Amtep
Edited: Jul 28, 2009, 4:16pm

Hmm. I suspect that the normal unix tools are up for this job, unless you have gigabytes of code to analyze. We don't need no stinkin' elegance!

This pipeline will find the longest identifiers for you:

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq | perl -n -e 'print length($_)-1, " ", $_' | sort -nr | head -10

(replace all braces with brackets to make it work. Is there any way to turn touchstones off?)

If you want to sort by the product of length and repetitions, in order to find the total bytes wasted on an identifier, you can use uniq -c and do a bit more magic in the perl script like this:

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq -c | perl -na -e 'print $F{0} * length($F{1}), " $F{1}\\n"' | sort -nr | head -10

(again, replace all braces with brackets)

If you have too many files to list them all in $FILES, then grep -r is very nice.

6Sander314
Edited: Jul 28, 2009, 4:57pm

#5: oohhh, that's clever :)

I don't think there's any way to turn off touchstones, though this might work: [whatever]
[] -> [ ]
ETA: works. very awkward in editing though.

Or just link to a pastie: Ruby solution

7bnielsen
Jul 29, 2009, 6:06am

Yes, #5 is clever. I'd use sed to split on blanks and maybe add a test so only strings that occur more than once are included, but that's just bells and whistles.

8Amtep
Jul 29, 2009, 6:29am

uniq has an option for that too: add -d to only print lines that occur more than once. I really love uniq :) And especially the combination of sort | uniq -c | sort -n has come in handy many times.

Where would you use sed? I didn't need it because grep -o already prints each match on a separate line. I didn't know about grep -o before I looked it up for this task, so clue++ for me :)

9bnielsen
Jul 29, 2009, 8:11am

I'd use

sed -e 's/ /\n/g' $FILES | sort ...

rather than

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort ...

so I'd get some slightly different strings, but that's just details. Hmm, uniq -d I didn't know about, so I have a few scripts with uniq -c | grep -v ' 1 '

clue++ for me too.

10Makis
Aug 3, 2009, 9:55am

I would personally get a programming IDE with refactoring capability to do this job. Eclipse seems to have a plugin for JavaScript refactoring:
http://www.hsr.ch/uploads/tx_icscrm/i_ba_2008_bachmann_pfister.pdf