Some will regard this as an enhancement request. To each his own, but IMO *grep has always had a huge deficiency when processing natural languages due to line breaks. PDFGREP especially because most PDF docs carry a payload of natural language.
If I need to search for “the.orange.menace“ (dots are 1-char wildcards), of course I want to be told of cases like this:
A court whereby no one is above the law found the orange
menace guilty on 34 counts of fraud..
When processing a natural language a sentence terminator is almost always a more sensible boundary. There’s probably no command older than grep that’s still in use today. So it’s bizarre that it has not evolved much. In the 90s there was a Lexis Nexus search tool which was far superior for natural language queries. E.g. (IIRC):
foo w/s bar
:: matches if “foo” appears within the same sentence as “bar”foo w/4 bar
:: matches if “foo” appears within four words of “bar”foo pre/5 bar
:: matches if “foo” appears before “bar”, within five wordsfoo w/p bar
:: matches if “foo” appears within the same paragraph as “bar”
Newlines as record separators are probably sensible for all things other than natural language. But for natural language grep is a hack.
Hm… yeah, I didn’t know that; I just sort of assumed that it was for searching code etc initially, but you are correct.
BTW I just learned about
pcregrep -M
which can do a little more directly what you’re asking for – you can dopcregrep -M 'the(.|\n)orange(.|\n)menace'
which seems to work, although you may want -A or -B to give a little more useful output also.