Some will regard this as an enhancement request. To each his own, but IMO *grep has always had a huge deficiency when processing natural languages due to line breaks. PDFGREP especially because most PDF docs carry a payload of natural language.
If I need to search for “the.orange.menace“ (dots are 1-char wildcards), of course I want to be told of cases like this:
A court whereby no one is above the law found the orange
menace guilty on 34 counts of fraud..
When processing a natural language a sentence terminator is almost always a more sensible boundary. There’s probably no command older than grep that’s still in use today. So it’s bizarre that it has not evolved much. In the 90s there was a Lexis Nexus search tool which was far superior for natural language queries. E.g. (IIRC):
foo w/s bar
:: matches if “foo” appears within the same sentence as “bar”foo w/4 bar
:: matches if “foo” appears within four words of “bar”foo pre/5 bar
:: matches if “foo” appears before “bar”, within five wordsfoo w/p bar
:: matches if “foo” appears within the same paragraph as “bar”
Newlines as record separators are probably sensible for all things other than natural language. But for natural language grep is a hack.
My understanding of GREP history is that Ken Thompson created grep to do some textual analysis on The Federalist Papers, which to me sounds like it was designed for processing natural language. But it was on a PDP-11 which had resource constraints. Lines of text would be more uniform to manage than sentences given limited resources of the 1970s.
Thanks for the PERL code. Though I might favor sed or awk for that job. Of course that also means complicating emacs’ grep mode facility. And for PDFs I guess I’d opt for pdfgrep’s limitations over doing a text extraction on every PDF.
Hm… yeah, I didn’t know that; I just sort of assumed that it was for searching code etc initially, but you are correct.
BTW I just learned about
pcregrep -M
which can do a little more directly what you’re asking for – you can dopcregrep -M 'the(.|\n)orange(.|\n)menace'
which seems to work, although you may want -A or -B to give a little more useful output also.