Home > Configuration Parameters > Regular Expressions

Cleanfeed - Regular Expressions

Overview

Apart from the EMP type filters Regular Expressions form the foundations for all the other types of filters in Cleanfeed. It's not vital to fully understand them in order to run Cleanfeed but it will be very advantageous to have at least a basic knowledge if you wish to customise its behaviour.

Regular Expression Resources

There are dozens of good Regular Expression resources on the World Wide Web and a Search is a good way to locate them. The official Perl manual page is a good reference but isn't really intended as a tutorial for beginners. The official Perl tutorial is excellent, as are many of the other hits your search will uncover.

Regex Details

The following table doesn't include the defaults for Regex's as many of them are very long indeed. I'd recommend browsing through the Cleanfeed Perl file and checking them there. If you think something is missing from the defaults that would improve the filter for other users, please email me.
Parameter Description
bin_allowed Which groups are exempt from binary filtering. To put it another way, these are the groups where binaries are acceptable.
image_allowed Whilst images are technically binaries, they are accepted in a broader range of groups than other types of binaries. This option defines groups in addition to bin_allowed where images are allowed.
image_extensions Files with these extensions are accepted in image_allowed groups.
bad_bin Binaries are not allowed in groups matching this, even if they are defined in bin_allowed. This enables bin_allowed to define a broad hierarchy of groups and then specific ones to be excluded by this option.
md5exclude Groups matching this regex are excluded from the EMP MD5 check. Where an article is crossposted, all the groups in the distribution must match md5exclude in order to be excluded.
poison_groups Reject all crossposts to these groups. If a message is posted to more than one group and any of the groups match this regex, it will be rejected. Use with caution, it's a potent filter.
allexclude If all the groups in a distribution match this regex, the message will be excluded from any filters.
score_exclude Exclude groups matching this from the scoring filter.
html_allowed If block_html is turned on, HTML formatted messages will be accepted in groups matching this regex. Where an article is crossposted, all the groups must match.
mime_html_allowed If block_mime_html is turned on, groups matching this will be excluded from the filter. Where an article is crossposted, all the groups must match. By default, no groups are defined as accepting MIME HTML.
test_groups Usenet has hundreds of test groups and the role of this regex is to match all of them. Test groups are excluded from all the EMP filters, except the PHR filter where groups must be explicitly defined.
low_xpost_groups Messages where the distribution contains a group matching this regex will be rejected if the number of groups exceeds low_xpost_maxgroups.
meow_groups Messages where the distribution contains a group matching this regex will be rejected if the number of non-meow groups in the distribution exceeds meow_ext_maxgroups.
no_cancel_groups Cancel messages to these groups will be rejected
fsl_exclude Groups matching this regex will be excluded from the FSL EMP Filter. Where an article is crossposted, all the groups must match.
phl_exclude See fsl_exclude.
phn_exclude See fsl_exclude.
phl_exempt Posts originating from these hosts will be excluded from the PHL EMP Filter.
phn_exempt See phl_exempt.
phr_exempt See phl_exempt.
flood_groups Messages posted (or crossposted) to a group matching this regex will be subject to the PHR EMP Filter By default, no groups are defined as high-risk, it's up to the operator to identify them at times when a flood occurs. Usually this is by way of abuse complaints as the operator can't watch tens of thousands of newsgroups.
supersedes_exempt Messages originating from a host matching this regex will be exempt from the Supersedes Filter.
refuse_messageids Messages with a Message-ID matching this regex will be rejected.
spam_report_groups Some groups are dedicated to dealing with spam. We want to exclude them from filtering as the content there is likely to be spam-like.
adult_groups Groups matching this regex are treated as Adult Groups, unless the match is negated by a subsequent match in not_adult_groups.
not_adult_groups Groups matching this will be treated as non-adult groups, even if they match adult_groups.
faq_groups Distributions where all the groups match this regex are granted a much higher accepted level of supersedes than normal groups. This is because FAQ postings are often superseding previous versions of the same FAQ. This parameter is ignored unless do_supersedes_filter is enabled.
shorturl This is a list of url's that offer url shortening or redirection services. These are frequently used to obfuscate spam url's. The regex is checked within the scoring section Cleanfeed.
bad_nph_hosts This regex matches against the NNTP-Posting-Host header. When a match occurs, the posting host will not be used to seed the NPH or PHR filters. In these instances, if phr_aggressive or phn_aggressive is True, the right-most FQDN in the Path header will be used instead. This protects against floods from services, (such as newsguy.com) that place unlinkable data in the NNTP-Posting-Host header.
topic1_groups By default this regex is empty. Groups that match it will trigger topic filters that work in the same way as meow_groups.
Example: The operator may elect to limit crossposting from adult content groups to non-adult groups. This could be done by defining:
topic1_groups => '\.sex'
This would limit crossposts between groups matching .sex and those that don't.
See also, off_topic_maxgroups and on_topic_mingroups.
topic2_groups Identical to topic1_groups. This parameter allows for the definition of a second list of topic groups.
Restricted_Groups Unlike all the other regex's this one is contained in a Perl Hash, keyed by a friendly name and containing the regex. Distributions with a group that matches this regex will be rejected if they also include a group that doesn't match it.