| Meghavvarnam 2005-11-28, 2:52 am |
| Ed Morton wrote:
> Meghavvarnam wrote:
>
> <snip>
>
> Now we're back to about my original suggestion, if there's no newlines
> in the searched text:
>
> while IFS= read -r string
> do
> grep -q ">${string}<" directory/*.htm &&
> echo "$string" >> usedStrings.txt
> done < allStrings.txt
>
> Alternatively, doing it all in awk, it's:
>
> gawk 'NR==FNR{strings[$0]++;next}
> { for (string in strings}
> if (index($0,">"string"<") {
> usedStrings[string]++
> delete strings[string] # for efficiency
> }
> }
> END { for (string in usedStrings)
> print string
> }' allStrings.txt directory/*.htm > usedStrings.txt
>
> Note that, since you said something in a previous posting about only
> wanting to look for text when it's part of an HTML tag (or something
> like that...) the search for ">"string"<" surrounds the line from
> "allStrings.txt" with ">" and "<" so it only matches when the text
> appears between those 2 characters. If you don't want that restriction,
> just get rid of the ">" and "<". Similairly for the grep solution.
>
This is the script that I tried -
# listused
# lists strings that are used in all .htm files
gawk 'NR==FNR{strings[$0]++;next} {
for (string in strings) #}
print string
if (index($0,">"string"<") || index($0,"\""string"\"")
|| index($0,">"string"\n")) {
usedStrings[string]++
delete strings[string] # for efficiency
}
}
END {
for (string in usedStrings)
print string
}' allStrings.txt htm/*.htm > usedStringsfile
Please let me know, if there is any mistake in this. I gave execute
permission for the file that contained this script and ran it from the
shell.
usedStringsfile was empty at the end of it.
Any pointers will be of great help.
> If you'd like the awk script to tell you which strings are/aren't used,
> that's trivial, e.g.:
>
> gawk 'NR==FNR{strings[$0]++;next}
> { for (string in strings}
> if (index($0,">"string"<") {
> usedStrings[string]++
> delete strings[string] # for efficiency
> }
> }
> END {
> print "Used Strings:"
> for (string in usedStrings)
> printf "\t%s\n",string
> print "Unused Strings:"
> for (string in strings)
> printf "\t%s\n",string
> }' allStrings.txt directory/*.htm
>
I modified the script above to remove all parse errors. Here is the
script that I used to try out -
gawk ' NR==FNR{strings[$0]++;next}
{ for (string1 in strings)
string = sprintf("<%s>", string1)
if (index($0,">"string"<")) {
usedStrings[string]++
delete strings[string] # for efficiency
}
}
END {
print "Used Strings:"
for (string in usedStrings)
printf "\t%s\n", string
print "Unused Strings:"
for (string in strings)
printf "\t%s\n", string
}' allStrings.txt htm/*.htm
I see the same behaviour with this as with the earlier script. Would we
need a different approach for this thing at all ??
What does the line - NR==FNR{strings[$0]++;next} do.
Thank you in advance so much for your help.
Megh
> If there can be newlines in the strings yopu're trying to match in the
> HTML files, then we need to figure out what "match" means since there
> aren't newlines in the strings in "allStrings.txt" and we need to figure
> out a different record separator than a newline char.
>
> Ed.
|