Bug reports on any software

115 readers

1 users here now

When a bug tracker is inside the exclusive walled-gardens of MS Github or Gitlab.com, and you cannot or will not enter, where do you file your bug report? Here, of course. This is a refuge where you can report bugs that are otherwise unreportable due to technical or ethical constraints.

⚠of course there are no guarantees it will be seen by anyone relevant. Hopefully some kind souls will volunteer to proxy the reports.

founded 3 years ago

MODERATORS

[email protected]

grep/pdfgrep’s inability to match across lines (sopuli.xyz)

submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/[email protected]

7 comments fedilink hide all child comments

Some will regard this as an enhancement request. To each his own, but IMO *grep has always had a huge deficiency when processing natural languages due to line breaks. PDFGREP especially because most PDF docs carry a payload of natural language.

If I need to search for “the.orange.menace“ (dots are 1-char wildcards), of course I want to be told of cases like this:

A court whereby no one is above the law found the orange  
menace guilty on 34 counts of fraud..

When processing a natural language a sentence terminator is almost always a more sensible boundary. There’s probably no command older than grep that’s still in use today. So it’s bizarre that it has not evolved much. In the 90s there was a Lexis Nexus search tool which was far superior for natural language queries. E.g. (IIRC):

foo w/s bar :: matches if “foo” appears within the same sentence as “bar”
foo w/4 bar :: matches if “foo” appears within four words of “bar”
foo pre/5 bar :: matches if “foo” appears before “bar”, within five words
foo w/p bar :: matches if “foo” appears within the same paragraph as “bar”

Newlines as record separators are probably sensible for all things other than natural language. But for natural language grep is a hack.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 2 points 3 months ago (1 children)

grep isn't really designed as a natural language search tool but perl -pe can do a pretty similar thing to what you're looking for.

perl -0777 -pe 's/\n/ /g' file.txt | perl -ne 'print "$1\n" while /(.{0,20}(the.orange.menace).{0,20})/g'

[–] [email protected] 1 points 3 months ago (1 children)

grep isn’t really designed as a natural language search tool

My understanding of GREP history is that Ken Thompson created grep to do some textual analysis on The Federalist Papers, which to me sounds like it was designed for processing natural language. But it was on a PDP-11 which had resource constraints. Lines of text would be more uniform to manage than sentences given limited resources of the 1970s.

Thanks for the PERL code. Though I might favor sed or awk for that job. Of course that also means complicating emacs’ grep mode facility. And for PDFs I guess I’d opt for pdfgrep’s limitations over doing a text extraction on every PDF.

[–] [email protected] 2 points 3 months ago

Hm... yeah, I didn't know that; I just sort of assumed that it was for searching code etc initially, but you are correct.

BTW I just learned about pcregrep -M which can do a little more directly what you're asking for -- you can do pcregrep -M 'the(.|\n)orange(.|\n)menace' which seems to work, although you may want -A or -B to give a little more useful output also.