JW

Jim Wilson

20/02/2004 8:18 PM

nfilter users

Here is yet another improvement.

rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*

This version eliminates messages with YUK in the subject,
regardless of whether any non-alphabetic junk is inserted
between the letters. You'll need additional similar lines in
your filter file, one for each different objectionable word.

Regular expressions must be enabled. To do this, go to
Edit-Configuration, and check the "Enable regular expressions"
box on the General tab.

Jim


This topic has 26 replies

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 3:59 AM

In article <MPG.1aa06dddfb07e7899897e9@localhost>, Jim Wilson <[email protected]> wrote:
>
>I'm sorry for this error. There is apparently a shortcoming in nfilter's
>regular expression interpreter. The regular expression [^A-Za-z] means
>"any character that is not a letter."

That's not the problem -- the problem is that "[Yy]*" matches Y, y, or
*nothing*.
>
>> >That filter is *grossly* defective. the '*' modifier character means
>> >"match ZERO OR MORE" of the preceding token.
>
>Well, you are right that it does not work correctly with nfilter,

Missing the point. The point is that it's an incorrect regexp. Period. Whether
it works with nfilter or not is irrelevant.

[snip]
>
>> >As written, your filter will catch *anything*.
>> > "(zero or more 'y') (zero or more 'anything') (zero or more 'u')
>> > (zero or more 'anything') (zero or more 'k')"
>> >which reduces to 'match anything'.
>>
>> Well, not quite. There's no asterisk after the K, so it reduces to 'match
>> anything containing a K'.
>
>Actually, both of these interpretations overlook the leading caret, which
>is the negation modifier for a list (or class, for you PERL users). The
>fact that it doesn't work in nfilter doesn't make it an incorrect regular
>expression. Impractical in this case, though, to be sure.

I'm not overlooking that at all, as my posts in this thread have made
abundantly clear.

It doesn't work in nfilter precisely because it IS an incorrect regexp. As I
posted earlier, you must either (a) omit the asterisk, or (b) replace it with
a plus sign, for it to do what you want. Regexps so constructed work.

I tested mine.


--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

bR

[email protected] (Robert Bonomi)

in reply to Jim Wilson on 20/02/2004 8:18 PM

22/02/2004 2:25 AM

In article <[email protected]>,
Doug Miller <[email protected]> wrote:
>In article <[email protected]>,
>[email protected] (Robert Bonomi) wrote:
>[snip]
>
>>The construct "([Yy]|[^A-Za-z])*" results in a lot of 'thrashing' inside
>>the code that does matching. "[^A-XZa-xz]*" is _literally_ orders of
>>magnitude more efficient.
>
>Elegant.
>
>Just curious, Robert ... how do the efficiencies of these two expressions
>compare?
>
>[Yy][^A-XZa-xz]*
>[Yy]+[^A-Za-z]*


The first one is noticeably faster, although not grossly so.

In the first case, the '[Yy]' drops out of consideration as soon as
a 'y' is encountered. Only a 'set membership' test is required on the
subsequent characters.

In the second instance, after the initial 'y' is encountered, a subsequent
character _could_ be (1) a continuation of the 1st token (another 'y'), or it
could be (2) the initial part of the next token ("not a letter"), or it could
(3) be a 'not match'. You _have_ to make *both* tests (1) and (2) on each
character following every 'y' -- because there could be 'something else'
following the "not a letter" test that would invalidate one presumption but
not the other

Contemplate something (nonsensical!) like the regex "[Yy]+[yY]+" and the
string "xxYYYyyzz".

Yes, the regex _does_ generate a match on that string -- starting at
the character immediately after the 2nd 'x'.

Now the mighty question rises, _which_ of the characters between the 'x'
and 'z' match with *which* of the two tokens in the regex?

The first one clearly matches the first "[]+" token,

And since we _do_ have a match (you can trust me on that :), one of the
four subsequent letters must match the 2nd "[]+" token. But _which_ one?
<evil grin>

The "internal processing" of regular expression match checking is *MESSY*!!
And a hell of a lot more complicated than it initially appears.

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 3:02 PM

In article <[email protected]>, [email protected] (Robert Bonomi) wrote:
[snip]

>The construct "([Yy]|[^A-Za-z])*" results in a lot of 'thrashing' inside
>the code that does matching. "[^A-XZa-xz]*" is _literally_ orders of
>magnitude more efficient.

Elegant.

Just curious, Robert ... how do the efficiencies of these two expressions
compare?

[Yy][^A-XZa-xz]*
[Yy]+[^A-Za-z]*

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

bR

[email protected] (Robert Bonomi)

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 11:27 AM

In article <MPG.1aa09990766445aa9897fd@localhost>,
Jim Wilson <[email protected]> wrote:
>Doug Miller wrote...
>>
>> That's not the problem -- the problem is that "[Yy]*" matches Y, y, or
>> *nothing*.
>> ...
>> Missing the point.
>
>You're absolutely right. I did miss the point, and it wouldn't work as
>intended for the very reason you were trying to point out. I compounded
>my error with a mis-diagnosis of the problem and then still further by
>misunderstanding what you were trying to say. Pretty laughable, even to
>me. All I can say is, egg on my face.

Now you begin to understand why I was "encouraging" you not to publish.

<grin>

"Regular expression" construction -- once you get much beyond the 'trivial'
cases -- is *very* arcane art. Unfortunately, there are many, *many*, ways
to accomplish complex matches -- frequently with the more 'obvious' ways
being far, *far*, less efficient than some less-obvious alternatives. I'm
not talking about a factor of 2-3, but sometimes 1000x or more. I've seen
a _one-line_ regexp (about 60 characters) that would require _centuries_ on
a fast computer to find a match (*if* it existed) in one line (of roughly
100 characters) of text. That particular "exercise in futility" is a
pathological case, unlikely to occur in the real world -- it was designed
to showcase the potential problems in the design of the regular expression
"language".

>Has this sort of cascading blunderfest ever happened in the shop? Nah.
>(G)
>
>Anyway, I finally have a filter that does what I want, nothing more and
>nothing less, and I have tested it extensively -- live, using nfilter in
>the alt.test newsgroup. It doesn't look like yours or Robert's, and it
>takes care of some conditions not covered by the other filter expressions
>I've seen so far here. Although we haven't seen all these "tricks" from
>our local trolls, I *have* seen them in spam email, so I figure it is
>only a matter of time before these variants start Andshowing up here.
>Because of that, I am going to paste an example below with the caveat
>that it's from a guy who's been screwing up royally on this subject all
>day long. Take it or leave it.
>
>Filter for "YUCK" or some variant of it in the subject:
>
>rec.woodworking Drop subject:[Yy]([Yy]|[^A-Za-z])*[Uu]([Uu]|[^A-Za-
>z])*[Cc]([Cc]|[^A-Za-z])*[Kk]
>
>Note that it should be only one line of text.

Yeah, that does 'pretty close' to what you say. Of course, it trips on
that 4-letter sequence 'in the middle of' another word -- like "Pennsyucky".

Also, while it _does_ work, it is a _seriously_inefficient_ expression.

The construct "([Yy]|[^A-Za-z])*" results in a lot of 'thrashing' inside
the code that does matching. "[^A-XZa-xz]*" is _literally_ orders of
magnitude more efficient. Admittedly, you have to tailor the 'not' list
depending on what the preceding character was. Also, characters that really
are doubled in the 'targeted' word require special consideration.

> Note also that the initial
>and final ".*" everyone's been using are not present as they are
>unnecessary.

Yup, you're ABSOLUTELY RIGHT about that. I actually use "something
completely different" for my filtering, which gets _entire_ header lines
to check against. Thus, when I want to check _only_ the "Subject: " line
my regexp _has_ to start with "^Subject:.*'.

The 'cost' of ".*" is -- for all practical purposes -- _ZERO_ when it
occurs at the beginning of an expression. The only difference is _where_
the 'start of match' occurs in the string being checked. Without the leading
".*", the match starts at the first character of the targeted word/phrase,
But with the ".*", the match begins at the beginning of the line. If you
use the 'results of the match' for other processing (which you cannot do
in nFilter/NewsProxy, but *can* do in other types of software) the difference
can be important.

At the end of a regexp, the 'cost' of ".*" is small, but it is utterly wasted
effort. The one that's even funnier is a trailing ".*$" -- `match zero or
more of anything, up to end-of-line'. *exactly* the same functionality as
a trailing ".*", _but_ introduces the additional overhead of an explicit
check after _every_ character, to see if end-of-line has been reached, or not.

Anyway, lazy me, I just strip the '^' from the beginning of the regexp I use,
and export to an nFilter rule set.


sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

22/02/2004 3:32 AM

In article <[email protected]>, [email protected] (Robert Bonomi) wrote:
>In article <[email protected]>,
>Doug Miller <[email protected]> wrote:
>>In article <[email protected]>,
>>[email protected] (Robert Bonomi) wrote:
>>[snip]
>>
>>>The construct "([Yy]|[^A-Za-z])*" results in a lot of 'thrashing' inside
>>>the code that does matching. "[^A-XZa-xz]*" is _literally_ orders of
>>>magnitude more efficient.
>>
>>Elegant.
>>
>>Just curious, Robert ... how do the efficiencies of these two expressions
>>compare?
>>
>>[Yy][^A-XZa-xz]*
>>[Yy]+[^A-Za-z]*
>
>
>The first one is noticeably faster, although not grossly so.

Thanks.
[snip explanation]

>Contemplate something (nonsensical!) like the regex "[Yy]+[yY]+" and the
>string "xxYYYyyzz".
>
>Yes, the regex _does_ generate a match on that string -- starting at
>the character immediately after the 2nd 'x'.
>
>Now the mighty question rises, _which_ of the characters between the 'x'
>and 'z' match with *which* of the two tokens in the regex?
>
>The first one clearly matches the first "[]+" token,
>
>And since we _do_ have a match (you can trust me on that :), one of the
>four subsequent letters must match the 2nd "[]+" token. But _which_ one?
><evil grin>

I'm gonna guess that it's the second lower-case y, as the character
immediately following that is not a match.

Is there a prize if I'm right?

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 3:22 PM

In article <MPG.1aa09a2fcdbbe78a9897fe@localhost>, Jim Wilson <[email protected]> wrote:
>Thomas Kendrick wrote...
>> Jim,
>> Thanks for trying to help the group develop useful filters.
>
>Tom,
>
>Thanks for your kind words of encouragement. Robert and Doug were
>absolutely right, though, and I don't blame them for ripping me a new one
>for my errors. I would have done the same if I could have reached around
>me that far. (G)
>
Jim,

I'm sorry my tone was harsh; I didn't mean to be "ripping you a new one". It's
important to bear in mind that, although a good number of the readers of this
ng are computer-savvy (including more than a few professional programmers),
there are many more who are not. And therefore, for their benefit, we must
make sure that any code that we post is as close to perfect as we can make it.

You've just bumped into the single hardest facet of software testing. Almost
any fool can verify that a piece of software does what it is intended to do.
It's infinitely more difficult to make sure that it does _not_ do what it is
_not_ intended to do. :-)

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

WB

"Wood Butcher"

in reply to Jim Wilson on 20/02/2004 8:18 PM

22/02/2004 10:47 PM

Thanks Doug.

I've found that Stop/Start occasionally will not load the new *.dat file
and this bit me earlier last week.

Art

"Doug Miller" <[email protected]> wrote in message
news:[email protected]...
> In article <bzPZb.362176$I06.3793815@attbi_s01>, "Wood Butcher" <[email protected]>
wrote:
> >"Doug Miller" wrote in message
> >> You've just bumped into the single hardest facet of software testing. Almost
> >> any fool can verify that a piece of software does what it is intended to do.
> >> It's infinitely more difficult to make sure that it does _not_ do what it is
> >> _not_ intended to do. :-)
> >
> >
> >How do *you* test your nfilter rules?
> >This is what I've been doing and if there is a better way I'd
> >sure appreciate knowing what it is.
> >
> >First close nfilter & restart it to load the new nfilter.dat file.
>
> Stop/Start is sufficient.
> >
> >Send test msgs to alt.test & see what happens.
>
> I use some of the groups under the alt.test hierarchy that have less traffic,
> which makes it easier and faster to find my test posts. Try alt.test.d .
> >[being sure to substitute alt.test for rec.woodworking in the rules
> >for this test.]
> >
> >In an alternate account (in OE) subscribe to the wreck and download
> >all message headers (about 12K at this time) and see what leaked thru.
>
> I'm using NewsXpress, which uses a file called NEWSRC to track groups and
> articles. It automatically saves a copy at startup; when I shut it down, I
> reload NEWSRC from the saved copy before restarting, so that it will scan the
> same set of articles again.
>
> >This tests for all the actual spam msgs to the wreck.
> >In nfilters "Dropped Articles" window, copy everything into an excel
> >worksheet. Manipulate the data to sort on the source in the
> >message-id field (i.e. what's after the @) and see what rules killed
> >those sources. The overwhelming bulk of the kills are for vulgarity and
> >3 or more x-posts so I figure those are good kills.
>
> I don't care much about that -- I'm far more concerned with the posts that
> should be killed but get through anyway.
> >
> >For the remaining few sources that could be bad kills how does
> >one xref the msg-id to a subject line?
> >
> I have configured a second instance of NewsXpress that connects directly to
> the newsserver instead of to Nfilter. If I need to spot the kills by subject,
> I compare what I see in the filtered and unfiltered versions, then look at the
> headers on the posts that don't match.
>
> --
> Regards,
> Doug Miller
>
> For a copy of my TrollFilter for NewsProxy/Nfilter,
> email me at filterinfo-at-milmac-dot-com
>
>

JW

Jim Wilson

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 2:18 AM

Doug Miller wrote...
> In article <[email protected]>,
> [email protected] (Robert Bonomi) wrote:
> >In article <MPG.1aa01987d119b9b59897e5@localhost>,
> >Jim Wilson <[email protected]> wrote:
> >>Here is yet another improvement.
> >>
> >>rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*
> >>
> >>This version eliminates messages with YUK in the subject,
> >>regardless of whether any non-alphabetic junk is inserted
> >>between the letters. You'll need additional similar lines in
> >>your filter file, one for each different objectionable word.
> >

I'm sorry for this error. There is apparently a shortcoming in nfilter's
regular expression interpreter. The regular expression [^A-Za-z] means
"any character that is not a letter."

> >That filter is *grossly* defective. the '*' modifier character means
> >"match ZERO OR MORE" of the preceding token.

Well, you are right that it does not work correctly with nfilter, which
is the important point, of course, and I did fail to detect that before
posting. When I tested it, it *did* effectively filter the posts that I
wanted it to. However, I didn't make sure that it left the good stuff
alone. For that, I do apologize. It's a beginner's programming error for
which there is no excuse.

> >As written, your filter will catch *anything*.
> > "(zero or more 'y') (zero or more 'anything') (zero or more 'u')
> > (zero or more 'anything') (zero or more 'k')"
> >which reduces to 'match anything'.
>
> Well, not quite. There's no asterisk after the K, so it reduces to 'match
> anything containing a K'.

Actually, both of these interpretations overlook the leading caret, which
is the negation modifier for a list (or class, for you PERL users). The
fact that it doesn't work in nfilter doesn't make it an incorrect regular
expression. Impractical in this case, though, to be sure.

> >Please *don't* publish things when you "don't know what you're doing".

Well, this is a rather strong admonishment. I don't mean get uppity, but
programming regular expressions is something that I do know a thing or
two about, having been using them "regularly" for a couple of decades as
a professional software developer. I admit to poorly testing this example
on this implementation, as well as my slash typo in an earlier message.
If these two careless mistakes are severe enough to warrant banning
further posts from me on the subject, then so be it, but I think that's
going a little far.

Cheers!

Jim

JW

Jim Wilson

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 5:24 AM

Doug Miller wrote...
>
> That's not the problem -- the problem is that "[Yy]*" matches Y, y, or
> *nothing*.
> ...
> Missing the point.

You're absolutely right. I did miss the point, and it wouldn't work as
intended for the very reason you were trying to point out. I compounded
my error with a mis-diagnosis of the problem and then still further by
misunderstanding what you were trying to say. Pretty laughable, even to
me. All I can say is, egg on my face.

Has this sort of cascading blunderfest ever happened in the shop? Nah.
(G)

Anyway, I finally have a filter that does what I want, nothing more and
nothing less, and I have tested it extensively -- live, using nfilter in
the alt.test newsgroup. It doesn't look like yours or Robert's, and it
takes care of some conditions not covered by the other filter expressions
I've seen so far here. Although we haven't seen all these "tricks" from
our local trolls, I *have* seen them in spam email, so I figure it is
only a matter of time before these variants start showing up here.
Because of that, I am going to paste an example below with the caveat
that it's from a guy who's been screwing up royally on this subject all
day long. Take it or leave it.

Filter for "YUCK" or some variant of it in the subject:

rec.woodworking Drop subject:[Yy]([Yy]|[^A-Za-z])*[Uu]([Uu]|[^A-Za-
z])*[Cc]([Cc]|[^A-Za-z])*[Kk]

Note that it should be only one line of text. Note also that the initial
and final ".*" everyone's been using are not present as they are
unnecessary.

Cheers!

Jim

JW

Jim Wilson

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 5:27 AM

Thomas Kendrick wrote...

> Jim,
> Thanks for trying to help the group develop useful filters.

Tom,

Thanks for your kind words of encouragement. Robert and Doug were
absolutely right, though, and I don't blame them for ripping me a new one
for my errors. I would have done the same if I could have reached around
me that far. (G)

Cheers!

Jim

JW

Jim Wilson

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 9:52 PM

Doug Miller wrote...
> Just curious, Robert ... how do the efficiencies of these two expressions
> compare?
>
> [Yy][^A-XZa-xz]*
> [Yy]+[^A-Za-z]*

These aren't equivalent expressions, so an efficiency comparison doesn't
seem pertinent.

Jim

JW

Jim Wilson

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 9:53 PM

Doug Miller wrote...
>
> I'm sorry my tone was harsh;

Hey, no sweat. My skin is thick, and I had it coming.

> You've just bumped into the single hardest facet of software testing.

Not the first time with over 20 years in the field. (G)

Jim

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 1:10 AM

In article <[email protected]>, [email protected] (Robert Bonomi) wrote:
>In article <MPG.1aa01987d119b9b59897e5@localhost>,
>Jim Wilson <[email protected]> wrote:
>>Here is yet another improvement.
>>
>>rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*
>>
>>This version eliminates messages with YUK in the subject,
>>regardless of whether any non-alphabetic junk is inserted
>>between the letters. You'll need additional similar lines in
>>your filter file, one for each different objectionable word.
>
>
>Please *don't* publish things when you "don't know what you're doing".
>
>That filter is *grossly* defective. the '*' modifier character means
>"match ZERO OR MORE" of the preceding token.
>
>As written, your filter will catch *anything*.
> "(zero or more 'y') (zero or more 'anything') (zero or more 'u') (zero or
> more 'anything') (zero or more 'k')"
>which reduces to 'match anything'.

Well, not quite. There's no asterisk after the K, so it reduces to 'match
anything containing a K'.

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

bR

[email protected] (Robert Bonomi)

in reply to Jim Wilson on 20/02/2004 8:18 PM

22/02/2004 11:14 AM

In article <[email protected]>,
Doug Miller <[email protected]> wrote:
>In article <[email protected]>,
>[email protected] (Robert Bonomi) wrote:
>>In article <[email protected]>,
>>Doug Miller <[email protected]> wrote:
>>>In article <[email protected]>,
>>>[email protected] (Robert Bonomi) wrote:
>>>[snip]
>>>
>>>>The construct "([Yy]|[^A-Za-z])*" results in a lot of 'thrashing' inside
>>>>the code that does matching. "[^A-XZa-xz]*" is _literally_ orders of
>>>>magnitude more efficient.
>>>
>>>Elegant.
>>>
>>>Just curious, Robert ... how do the efficiencies of these two expressions
>>>compare?
>>>
>>>[Yy][^A-XZa-xz]*
>>>[Yy]+[^A-Za-z]*
>>
>>
>>The first one is noticeably faster, although not grossly so.
>
>Thanks.
>[snip explanation]
>
>>Contemplate something (nonsensical!) like the regex "[Yy]+[yY]+" and the
>>string "xxYYYyyzz".
>>
>>Yes, the regex _does_ generate a match on that string -- starting at
>>the character immediately after the 2nd 'x'.
>>
>>Now the mighty question rises, _which_ of the characters between the 'x'
>>and 'z' match with *which* of the two tokens in the regex?
>>
>>The first one clearly matches the first "[]+" token,
>>
>>And since we _do_ have a match (you can trust me on that :), one of the
>>four subsequent letters must match the 2nd "[]+" token. But _which_ one?
>><evil grin>
>
>I'm gonna guess that it's the second lower-case y, as the character
>immediately following that is not a match.

Yes, the overall match covers all the 'y' characters, regardless of case.
The question I meant to pose is "which token matches _which_ characters?"

Is it a one-character match for the first token, and four for the second,
or two and three, or three and two, or four and one?

For extra credit, _when_ is that determination made, and on what basis?

>
>Is there a prize if I'm right?

If you're into musical instruments, I've got a spare set of hardware for
an air guitar I could liberate. You'll have to pick up the shipping charges
though.

WB

"Wood Butcher"

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 8:53 PM

"Doug Miller" wrote in message
> You've just bumped into the single hardest facet of software testing. Almost
> any fool can verify that a piece of software does what it is intended to do.
> It's infinitely more difficult to make sure that it does _not_ do what it is
> _not_ intended to do. :-)


How do *you* test your nfilter rules?
This is what I've been doing and if there is a better way I'd
sure appreciate knowing what it is.

First close nfilter & restart it to load the new nfilter.dat file.

Send test msgs to alt.test & see what happens.
[being sure to substitute alt.test for rec.woodworking in the rules
for this test.]

In an alternate account (in OE) subscribe to the wreck and download
all message headers (about 12K at this time) and see what leaked thru.
This tests for all the actual spam msgs to the wreck.
In nfilters "Dropped Articles" window, copy everything into an excel
worksheet. Manipulate the data to sort on the source in the
message-id field (i.e. what's after the @) and see what rules killed
those sources. The overwhelming bulk of the kills are for vulgarity and
3 or more x-posts so I figure those are good kills.

For the remaining few sources that could be bad kills how does
one xref the msg-id to a subject line?

Art

bR

[email protected] (Robert Bonomi)

in reply to Jim Wilson on 20/02/2004 8:18 PM

21/02/2004 1:02 AM


I've updated the filter rules at: <http://www.r-bonomi.com/rec.woodworking>

They're broken out into several separate files, for:
hosts/domains/IP-addresses the troll posts from
inappropriate and/or excessive cross-posting
vulgarities
politics

Simply download the rule-sets you want to use, append together to create a
single file, and install.

Note: The main web-page now shows the date/time that each filter set was
last modified.

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

20/02/2004 8:58 PM

In article <[email protected]>, Gordon Airport <[email protected]> wrote:
>
>Trying to keep up with a subject filter arms-race seems like a bad idea.
>It's too easy to make a slight change and get more crap through,
>especially if you're posting the filters. It looks to me like filtering
>on "nym.alias.net" in the header might be the way to go. There may be
>some false positives, but I don't think many people here are using
>remailers (or wherever that string comes from.)
>

That would be the best -- for those of us whose newsservers supply the Path or
Organization headers when Nfilter asks for them. Mine doesn't, unfortunately,
so I have to filter by subject keywords or by Message-ID. Fortunately, the
latter works pretty well until this bozo gets kicked off and finds another
ISP.

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

bR

[email protected] (Robert Bonomi)

in reply to Jim Wilson on 20/02/2004 8:18 PM

20/02/2004 11:41 PM

In article <MPG.1aa01987d119b9b59897e5@localhost>,
Jim Wilson <[email protected]> wrote:
>Here is yet another improvement.
>
>rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*
>
>This version eliminates messages with YUK in the subject,
>regardless of whether any non-alphabetic junk is inserted
>between the letters. You'll need additional similar lines in
>your filter file, one for each different objectionable word.


Please *don't* publish things when you "don't know what you're doing".

That filter is *grossly* defective. the '*' modifier character means
"match ZERO OR MORE" of the preceding token.

As written, your filter will catch *anything*.
"(zero or more 'y') (zero or more 'anything') (zero or more 'u') (zero or more 'anything') (zero or more 'k')"
which reduces to 'match anything'.

Yes, it will get rid of the troll posts. at the price of getting rid of
_everything_else_! This qualifies as a "BAD IDEA(TM)"

I've expanded/updated the filter files at
<http://www.r-bonomi.com/rec.woodworking/>
so one can apply filters in 'groups', by simply appending individual
files together.

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

22/02/2004 1:13 AM

In article <MPG.1aa18116422f3d6e9897ff@localhost>, Jim Wilson <[email protected]> wrote:
>Doug Miller wrote...
>> Just curious, Robert ... how do the efficiencies of these two expressions
>> compare?
>>
>> [Yy][^A-XZa-xz]*
>> [Yy]+[^A-Za-z]*
>
>These aren't equivalent expressions, so an efficiency comparison doesn't
>seem pertinent.
>
I realize they're not equivalent, but for the purposes to which we're putting
them, they might as well be.

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

wD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

23/02/2004 2:41 AM

In article <9ka_b.101630$jk2.468963@attbi_s53>, "Wood Butcher" <[email protected]> wrote:
>Thanks Doug.
>
>I've found that Stop/Start occasionally will not load the new *.dat file
>and this bit me earlier last week.
>
I haven't seen that -- however, I have, several times, had to shut down and
restart the *client* when simply disconnecting and reconnecting didn't work.
Go figure...

--
Regards,
Doug Miller (alphageek-at-milmac-dot-com)

For a copy of my TrollFilter for NewsProxy/Nfilter,
send email to autoresponder at filterinfo-at-milmac-dot-com

TK

Thomas Kendrick

in reply to Jim Wilson on 20/02/2004 8:18 PM

20/02/2004 9:25 PM

On Sat, 21 Feb 2004 02:18:32 GMT, Jim Wilson <[email protected]>
wrote:

<snip of comment>

Jim,
Thanks for trying to help the group develop useful filters.

Newsproxy DOES actually produce a suppression log which I have used to
see which rule deleted a message. My take is that if I copy the
results of your labors, it will be my responsibility to validate that
it works for me, either via understanding the filter logic or testing
it myself. All that I would assume is that you have given best effort
to test the code in a few situations. Certainly expecting a full
system regression test would be excessive.
Thanks,
Tom

>If these two careless mistakes are severe enough to warrant banning
>further posts from me on the subject, then so be it, but I think that's
>going a little far.
>
>Cheers!
>
>Jim

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

20/02/2004 10:42 PM

In article <[email protected]>, alexy <[email protected]> wrote:
>Jim Wilson <[email protected]> wrote:
>
>>Here is yet another improvement.

As I noted in another post in this thread, it's not.
>>
>>rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*
>>
>Actually, wouldn't that get anything with the letters y, u, and k in
>that order? You really don't want to miss messages about firetrucks!
><g>

Perhaps you missed the ^ character? That *negates* what follows; thus,
"[^A-Za-z]*" means "any sequence of zero or more consecutive NON-alphabetic
characters".

However, this filter will in fact kill "firetruck" as a subject, but that's
because it kills anything with a K in it anywhere. See my other post in this
thread for a full description.

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

22/02/2004 1:11 AM

In article <bzPZb.362176$I06.3793815@attbi_s01>, "Wood Butcher" <[email protected]> wrote:
>"Doug Miller" wrote in message
>> You've just bumped into the single hardest facet of software testing. Almost
>> any fool can verify that a piece of software does what it is intended to do.
>> It's infinitely more difficult to make sure that it does _not_ do what it is
>> _not_ intended to do. :-)
>
>
>How do *you* test your nfilter rules?
>This is what I've been doing and if there is a better way I'd
>sure appreciate knowing what it is.
>
>First close nfilter & restart it to load the new nfilter.dat file.

Stop/Start is sufficient.
>
>Send test msgs to alt.test & see what happens.

I use some of the groups under the alt.test hierarchy that have less traffic,
which makes it easier and faster to find my test posts. Try alt.test.d .
>[being sure to substitute alt.test for rec.woodworking in the rules
>for this test.]
>
>In an alternate account (in OE) subscribe to the wreck and download
>all message headers (about 12K at this time) and see what leaked thru.

I'm using NewsXpress, which uses a file called NEWSRC to track groups and
articles. It automatically saves a copy at startup; when I shut it down, I
reload NEWSRC from the saved copy before restarting, so that it will scan the
same set of articles again.

>This tests for all the actual spam msgs to the wreck.
>In nfilters "Dropped Articles" window, copy everything into an excel
>worksheet. Manipulate the data to sort on the source in the
>message-id field (i.e. what's after the @) and see what rules killed
>those sources. The overwhelming bulk of the kills are for vulgarity and
>3 or more x-posts so I figure those are good kills.

I don't care much about that -- I'm far more concerned with the posts that
should be killed but get through anyway.
>
>For the remaining few sources that could be bad kills how does
>one xref the msg-id to a subject line?
>
I have configured a second instance of NewsXpress that connects directly to
the newsserver instead of to Nfilter. If I need to spot the kills by subject,
I compare what I see in the filtered and unfiltered versions, then look at the
headers on the posts that don't match.

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

aa

alexy

in reply to Jim Wilson on 20/02/2004 8:18 PM

20/02/2004 9:32 PM

Jim Wilson <[email protected]> wrote:

>Here is yet another improvement.
>
>rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*
>
>This version eliminates messages with YUK in the subject,
>regardless of whether any non-alphabetic junk is inserted
>between the letters. You'll need additional similar lines in
>your filter file, one for each different objectionable word.
>
>Regular expressions must be enabled. To do this, go to
>Edit-Configuration, and check the "Enable regular expressions"
>box on the General tab.
>
>Jim

Actually, wouldn't that get anything with the letters y, u, and k in
that order? You really don't want to miss messages about firetrucks!
<g>
--
Alex
Make the obvious change in the return address to reply by email.

sD

[email protected] (Doug Miller)

in reply to Jim Wilson on 20/02/2004 8:18 PM

20/02/2004 10:38 PM

In article <MPG.1aa01987d119b9b59897e5@localhost>, Jim Wilson <[email protected]> wrote:
>Here is yet another improvement.

Not in my opinion. It does not appear that you tested this very carefully, if
at all. Look at the test posts I put in alt.test.cztery. Then run this filter
on them and see which ones it drops.
>
>rec.woodworking Drop subject:.*[Yy]*[^A-Za-z]*[Uu]*[^A-Za-z]*[Kk].*
>
>This version eliminates messages with YUK in the subject,

It does a lot more than that -- it eliminates all messages with the letter K
anywhere in the subject.

Examine it piece by piece:

.* = any string of ZERO or more characters
[Yy]* = ZERO or more upper or lower case Y
[^A-Za-z]* = ZERO or more non-alphabetic characters
[Uu]* = ZERO or more upper or lower case U
[^A-Za-z]* = ZERO or more non-alphabetic characters
[Kk] = exactly ONE upper or lower case K
.* = any string of ZERO or more characters

Thus, any string that has an upper or lower case K anywhere in it will be
removed by this filter. Probably not what you want.

Either remove the asterisks after the characters of the objectionable word(s)

.*[Yy][^A-Za-z]*[Uu][^A-Za-z]*[Kk].*

or replace them with plus signs (match ONE or more)

:.*[Yy]+[^A-Za-z]*[Uu]+[^A-Za-z]*[Kk]+.*

and it will work a *lot* better.

--
Regards,
Doug Miller

For a copy of my TrollFilter for NewsProxy/Nfilter,
email me at filterinfo-at-milmac-dot-com

GA

Gordon Airport

in reply to Jim Wilson on 20/02/2004 8:18 PM

20/02/2004 3:39 PM


Trying to keep up with a subject filter arms-race seems like a bad idea.
It's too easy to make a slight change and get more crap through,
especially if you're posting the filters. It looks to me like filtering
on "nym.alias.net" in the header might be the way to go. There may be
some false positives, but I don't think many people here are using
remailers (or wherever that string comes from.)

- Doug


You’ve reached the end of replies