Web

Spam taking up most of the Internet bandwidth?

Is spam taking up a lot of both the network as well as CPU bandwidth on the Internet?

Spam is not just email, it also comes in another form - spam comments on websites. A large number of machines seem engineered to attack any website that allows for user comments. aczoom.com is a low volume personal site, so it was a shock to see the monthly bandwidth go up from the usual 2-3G per month to over 9G/month. Investigating this led to the conclusion that is is mostly comment spam activity, and most of it coming from machines in China.

Spam Load Stats - Click for full-size

The attached image (click on image for full size) shows the huge increase in spam processing. This may not have been a problem, but it also causes a huge increase in the system load. Over this same period, IP addresses from China accounted for 67% of total traffic, and over 6G of network traffic. I very seriously doubt people in China have any interest in any part of this aczoom.com site.

Here's a table that shows data from the AWstats and spam logging programs:

Website Activity
/ Before Nov 2012 As of Nov 2012 Change due to Spam
Bandwidth Used 2.2 GB/month 9.4 GB/month 4x
Spam Comments Attempted
(average)
500 / day 5000 / day 10x
Spam Comments Attempted
(peak)
1,000 / day 10,000 / day 10x
Bandwitdh used by
IP addresses from China
0.3 GB/month 6.3 GB/month 20x

Google Chromebook tips

[Updated after a week - Chromebook is actually a pain to use, not yet ready for prime time. Fine if you are always online, even then, user experience is not smooth. There are just too many bugs - which means that they will be fixed in due time, but the existence of such basic problems makes it hard to recommend Chrome to everyone at this time. Worse - its offline mode is buggy - I lost hours of work. Read on.]

Having played with the new Google Chromebook for a week now, it is a great device! Well, so I thought after one day of use. After a week, ran into too many bothersome issues, some are listed below. I've played with both the 2012 devices: Samsung Chromebook (US$249) and Acer Chromebook (US$199).

Samsung device looks sleeker, and boots faster (10 seconds), and needs no internal fan. Acer looks a bit clunkier, but its CPU is slightly faster (20% in some web tests), and has a huge 320G hard drive. Full reviews available on the web as well as youtube, and it is worth reading through a few to get some tips on how to use this device well.
web search YouTube search.

Computerized directions can be completely wrong

This is a cautionary tale about depending on getting directions from a web site.
This example is using Google Maps, but I suspect such problems lie with all the systems.

I know Montreal pretty well, so when this person looking lost on the street showed me the Google Map directions I was astonished to see that the directions were completely in the opposite direction to the desired destination.
It claimed to provide walking directions from Metro Station Place-d'Armes to The Quays Skating Rink, Ville-Marie, Montreal. Now The Quays are in Old Montreal - which is South of the metro.

Google Maps gets that right when you just search for quays skating rink in Montreal. But it somehow gets confused when Get Directions is clicked for that place, with a starting point of Station Place-d'Armes. Instead of telling the user to head south towards the river the directions point northwards into the city!
There is no skating rink in that part of the town at all. And the Quays are quite famous landmarks in Montreal. So this was a fail on the part of Google Maps. Not a big deal actually keeping in mind that it is easy to use Google Maps itself to get a second opinion regarding directions to verify them.

To back up, here are the directions from the Google printout I saw:

1. Head northeast on Avenue Viger O toward Rue Saint-Urbain.
This is actually confusing. The Metro is right on that street, so Ouest or Est is not very helpful. Secondly, people in Montreal are used to calling streets going northeast-southwest as just east-west.
2. Turn left on Rue Saint Urbain.
3. Turn right on Boul Rene-Levesque O S
4. Turn left on Rue Clark - 60m.
Arrive: The Quays Skating Rink, Ville Marie, QC (NOT!)

Spam Email Counts

Is email on the way out? That is probably not yet an easy question, but the amount of spam seems to be holding steady, with periodic bursts of spam email storms.

Here are some graphs of spam at one of my mailboxes. This is for a very public email address. The spam detection is using spamassassin which runs under procmail with a customized whitelist and blacklist. Over the few years I've used this, there have been only 1-2 false positives for spam (of course, detection of false positives is not easy since this requires digging through 100s of spam messages, but I have no reason to believe that false positives are more prevalent). There have been quite a few false negatives - messages that are spam, but missed by spamassassin. These are usually around 1%-10% of the total detected spam messages, which is low enough that the graphs below are still useful to show the trend of spam message counts.

2010 Spam Counts 2010 Spam Counts
The Spam Counts images are updated periodically, usually every day, to include data of the previous complete 24-hour period.
[This image is no longer updated - the last counts will be for 2010-September. I no longer pull all email, there are only 1-3 non-spam emails and 80-100 spam messages per day. Will move all email use to one of the publicly available web sites, and have started using text chat more, and possibly move to voice chat in future too. Email is no longer very useful for home use.]

Link Filter Drupal Module

Here's yet another URL Link Filter for Drupal, latest version can be downloaded from: linkfilter-4.x-5.x-1.1.zip

Drupal 6.x version is available here: linkfilter-6.x-1.2.zip.

The goal for this filter is to be somewhat like the URL filter included with Drupal, with the additional requirement to be Drupal installation directory independent as well as domain independent so that the URLs in Drupal nodes don't have to be re-edited when a Drupal site is moved to a different sub-directory or a different domain. Additionally, it allows for link text to be specified for the URL, and it preserves the input characters as much as possible, performing no or minimal HTML entity conversions of the input characters. Finally - it distinguishes various links with classes, which can be used to display link icons for specific links. If the link filter tag points to internal Drupal node, then a class containing the type of the node is generated, for example, class="linkfilter-drupal-node-image", which can be used to show distinguishing icons based on Drupal node type. This site uses this filter, and the link icons are displayed based on the class generated by the filter: for external links (linkfilter-urlfull class), images (linkfilter-drupal-node-acidfree or linkfilter-drupal-node-image class), mailto links (linkfilter-mailto class).

Link filter tags [l:URL text] in the input text will be replaced with a link to the given URL, which can be a Drupal link, an external web link, or a local non-Drupal link. Prefixes representing the site url and the Drupal directory are added, as appropriate:
1) Site url is prefixed if URL begins with a / character
2) No prefix is added if the URL has a : in it, as in http: or ftp: etc
3) Site url with Drupal base directory is prefixed in all other cases, this is handled by calling the Drupal l() function.

Web provider changes umask, Gallery stops working

So, I maintain a few web sites and one site uses Gallery software. This worked fine until recently - when a user tried to create a new album, it failed with an error about being unable to create lock files [ Error: Could not open lock file (/..../public_html/albums/album01/photos.dat.lock) for writing! ]

That was a somewhat misleading error, but - in the end, this turned out to be a hosting provider issue, and not a Gallery issue.

Turns out the hosting provider advertently or inadvertently set the default umask for process under Apache to 0111. This umask removes all execute permissions from new files and directories created by scripts run under Apache.

Gallery keeps the default umask, so it inherited the 0111 umask, and when it tried to create a directory with permissions 0700, it in fact got a directory with a permission of 0600 - read, write, but no execute. Of course, without execute permission, a directory is not of much use - cannot move into that directory, cannot create files in that directory, basically, things will start erroring out from that point out. Software could be written to handle this - maybe always do a chmod after a mkdir? But that is a different discussion.

It did not take too long to find this out, but getting this resolved at the hosting provider took a while - explaining umask, mkdir, and directory behavior. I guess that is the first reaction of technical support - they must get too many false reports, that when a real problem comes up, they have to take some time! [Though I am happy with the provider - they were at least engaged and responded fast with questions, and in the end, they resolved this pretty quickly.] Add to this, all morning today the Gallery site was inaccessible - so I could not search the forums for this issue. In any case, this was a new problem, not previously posted on the Web, nor mentioned at the Gallery site.

GoDaddy heavy handed in shutting down domains

nodaddy.com tells a scary story of how Go Daddy went in and disabled a site, for what seems to be totally unjustified reasons, and totally insufficient attempts made to

What is worse, is that Go Daddy continues to insist they did they right thing, compounding the significance of this issue.

My domains are registered with Go Daddy, I was hoping for a better response from them - maybe say it was a mistake and that they now have a new process in place to handle this, but looks like that is not going to happen.

Need to think hard if this is a good domain registrar - though looks like it is not that is easy to find any registrar that is good in this respect, at least in the US.

Spammers are here

Like clockwork, these pests are able to locate new sites on the web, to infect with their spam postings!

For the past two years, I've not updated these pages much, so had escaped the attention of spammers. But just a month ago, when I got back to updating this site regularly, the spammers have woken up and now I've to keep a closer watch on the new posting activity here.... sigh...

Currently just using the spam module in drupal, which is working fine - though may need to change the configuration so that it marks things as spam as soon as a single URL is found in the post.

Yahoo still no good

I used to use Yahoo a lot, and now go there only for news and email - not for search, and will probably switch over for those old legacy uses as soon as I find something else.

In Firefox, if I go to the Yahoo search page, it would tell me:

Use Yahoo! to search from Firefox

which is something I am never going to do - for a long time now, Yahoo search results have always excluded mention of any aczoom.com link, other than somewhere way below in the search results. Given this, no way am I going to switch my Firefox search box to Yahoo - Google is much better.

This is why I presume there are so many pages on the web related to "SEO" - trying to figure out what search engines do, and how - it is too complex. For example - how come a search engine can list 100s of pages that all link-back or refer to the main page on a topic, but never have that main page show up higher on the search results?

aczoom.com uses Drupal and I suspect this issue is related to that, and URL re-writing is confusing the Yahoo bots. But there are also static pages at this site, and Yahoo does not show them either. And both Google as well as Microsoft MSN search get it right - for all Drupal hosted pages here, as well as the static pages.

So, Yahoo shall remain unused in the Firefox browsers that I use. Hopefully, this should have Yahoo engineers all concerned and rush to fix their search engine :-)!!! Until then, I'll stick with Google Search.

Older blog entry related to same topic: search engine fun

Search engine fun!

Given rare, unique words on a web page, one would expect search engines could easily determine the top sites to list for the keywords.

My interest led me to these keywords: "aczoom home page".

Google search, MSN search, Ask Jeeves search, all list my home page as the first or second item in the search results. They also list other aczoom pages in their results, and adding more keywords can find links to key pages at my site.

Here's a picture of the Google results in January 2006.

Yahoo search results are strange - they do not list a single page hosted at aczoom.com for the above search! They do list numerous pages that link to aczoom.com, but not a single direct aczoom.com page is listed.

[Well, one page is listed, but that area is supposed to be off-limits to search engines, I guess robots.txt does not work as it is supposed to work.]

Here's a picture of the Yahoo results in January 2006.

I have mostly used Google for my searches, but recently got intrigued with the issue search engines have with handling redirects, and I use Drupal, so started checking out how search engines behaved. My conclusion is that if Yahoo can't get this simple query right, it diminishes my confidence in the credibilily of their search results. This is also technically interesting - how is Yahoo building their list, that would result in this situation?

I did try to help them along, submitted aczoom.com manually to Yahoo, but that was a while ago. Shouldn't have had to do that anyway - they have so many pages that link to aczoom.com, would that not lead them to spider aczoom.com itself?

Syndicate content