SEO, PHP, Web search

Issuing 404 response from PHP scripts

2008-12-06 | Comment?

If some page is not there anymore on your web server, tell that to the the visitor and to the search engine bot. They will both benefit from the message you have sent to them.

Users will see that the page doesn’t exists anymore. It is a good practice to design your 404 pages to look like the rest of your site. Clearly notify the visitor that the page isn’t there anymore and offer him various options to continue exploring your site’s content. Site search box is a nice touch on custom 404 pages.

Search engines bots, including Googlebot, will see that the page doesn’t exist anymore if they receive 404 Http response from the server. Googlebot will then remove your page from the search engine index, so yes, the number of your pages in Google index will decrease, but on the other hand, you will make the room for other pages from your site that contain more valuable information.

Issuing 404 errors

It is easy to return 404 with static pages. If you remove test.html from your web server, the next time when you navigate to domain.com/test.html you will receive 404 not found error. If you have configured Apache to display custom error messages, specified 404 page will be displayed.

But what if you publish the content from your database by using single script? Lets say that you are publishing news articles by using /article.php?id=1234 where 1234 is article ID in database. While the article is in the database everything will be fine, but what if you remove the article? How your script article.php handles that situation?

As I seen on the sites worldwide, some sites just publish blank page with header and footer, obviously hiding internal errors, but not displaying any human explanation what happened. Some badly configured web servers even print the internal SQL error that the article is not found, which is a disaster both for site security and user’s experience.

The best practice of course is to detect the issue and print the notification message for humans that the article doesn’t exist anymore. But you should also inform the Googlebot that the article has been deleted. You can do that by sending 404 header to the bot by using PHP header() function.

header("HTTP/1.0 404 Not Found"); // message for bots
echo "Article not found"; // message for humans
exit();

Even better, if you have custom 404 page for Apache somewhere, you can include that page directly from PHP script so all 404 pages on your web site will have the same appearance.

header("HTTP/1.0 404 Not Found");
include("/var/www/errors/404.html");
exit();

Frontend, Web search

Www or no www?

2008-07-16 | 1 Comment

Do you redirect your domain.com to www.domain.com or backward www.domain.com to domain.com? If not, you should. You should pick one URL version, www or no-www, and use 301 to redirect another URL version to the one you have picked up. Here is why:

  • Some search engines might consider domain.com and www.domain.com as two different domains so your link juice might split instead of focusing on one single domain. Although I believe Google now figures out these situations, it is still a good practice to redirect to one single URL version.
  • It might speed up your pages load time. For example, when you visit www.domain.com your browser will cache all images, styles and javascripts so the next time when you navigate to the site, it will show up much quicker. But if you visit domain.com it is a whole new site for your browser and the browser will download images, styles and javascripts all over again. With redirections to single URL version you are saving bandwidth and reduce site load time if user accidentally or through some external link visit the other URL version of your site.
  • Same goes for cookies and logging. If you are logged on www.domain.com and then navigate your browser to domain.com, it will be a completely new site for your browser and you will be forced to log in again.

So, www or no www?

We have seen that it is a good practice to choose one URL version and redirect other on it, but which one to pick? Wwww or no-www version? Basically www is deprecated and there are even organizations that promote www removal like no-www.org.

But, if you are aiming mainstream users, I believe that www is still strongly memorized im peoples minds as a synonym for the web and most of them will type www in their location bars. If you use browser shortcuts you also know that Ctrl+Enter in location bar will add `www.` prefix and `.com` sufix on the name you have entered. And I still believe that most of the people will use www when linking your site from their blogs or webpages. So, I believe that www is still a better option for most of the sites.

However, if you are using some fancy web 2.0 domain name like del.icio.us or your service is targeted to tech guys and geeks, like this blog, no-www version might be a good choice as well.

Don’t forget that from Google webmasters tools you can choose which version of your domain name will show up in Google SERPs.

Technical stuff

If you are using Apache web server, you can redirect all urls using mod_rewrite by adding these lines in your .htaccess file:

For www to no-www:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.(.+)$ [NC]
RewriteRule ^(.*)$ http://%1/$1 [R=301,L,NE]

For no-www to www:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^(yourdomain\.com) [NC]
RewriteRule ^(.*) http://www.%1/$1 [r=301,L,NE]

This wil redirect all your web pages, for example domain.com/category/page will be redirected to www.domain.com/category/page.

Sphinx, MySQL

Sphinx is live and kicking

2008-02-24 | Comment?

Sphinx logoAs I have mentioned in my previous post, I have started using Sphinx as a search engine on my site. It was a very exciting week for my site with a record breaking number of visitors. Sphinx has handled it great. I must say that MySQL full text search have done a great job too. There was some slow queries before, but after adding 1GB of RAM memory couple months ago, I really haven’t experienced any major problems with MySQL full text search. But I have realized that I would have to switch to some better search engine sooner or later.

Sphinx has replaced MySQL full text search successfully. Search is very fast and the CPU usage is lower even with the increasing number of visits. Not only that I don’t use MySQL fulltext for searching, but the removing full text indexes surely helped the CPU, since it doesn’t have to constantly update fulltext indexes on each update. Instead, I have implemented Sphinx main+delta scheme for re-indexing. Main index is re-indexed nightly while the delta index is rebuilt every 15 minutes.

So far, so good…

Solr, Sphinx, Lucene, PHP, API, MySQL

Sphinx and Lucene search engines, first impressions

2008-02-15 | 2 Comments

MySQL full text search is the easiest way to provide full text functionality on your site. It is easy to implement and easy to use. You just have to specify MySQL FULLTEXT index, and after that you search the tables with appropriate SQL queries.

However, there are some disadvantages. On a large databases full text search might perform very slow. Also, MySQL full text indexes are available only on MyISAM tables, so if you are using InnoDB table engine to store critical data, you won’t be able to use MySQL full text search.

In order to scale the search of large data sets you will have to look for the solution beyond MySQL. Here I will write about my first thoughts on two popular open source search engines, Lucene and Sphinx.

I will not talk about performances and speed. This post presents basic things about installation, configuration and implementation of each search engine, so it might help you to choose between Lucene and Sphinx if you want to switch from MySQL full text search on something better, quickly and easily.

MySQL Lucene Sphinx search comparationDon’t think about performances at this moment, search should be faster than MySQL full text search. I will show you only one slide from this presentation which you can find on MySQL Performance Blog to demonstrate the speed difference between MySQL, Lucene and Sphinx on a 5GB test database (of course, results may vary depending on your datasets). If you are interested in alternative full text search engines, you should take a look on the entire presentation (38 slides, PDF)

Lucene

Lucene logoLucene is actually a Java library, so basically you don’t have to install anything to use Lucene. It is a very powerful tool for indexing and searching documents. Lucene doesn’t care do you index files, database records or something else. It is your duty to prepare data and send it to Lucene index as Document object. That means that you will need some extra work to index your database records, or whatever data you want to search. Once created, index can be easily updated or expanded by adding/updating documents from your scripts and programs.

Lots of popular heavy traffic sites are powered by Lucene search like Wikipedia or Digg, so the performance and scalability is not under the question. Lucene is also rich with search options providing boolean search, phrase search, wildcard search, field search and other features.

Implementation around Lucene might be a little bit problematic if you are not using Java. If your project is in PHP, like mine, you can use something like PHP-Java bridge. There is also a PHP Lucene class in Zend framework. I haven’t tried the class, but I have read here that indexing and searching performances are very poor comparing to the original Lucene Java class. It might not be the truth, but you should have that in your mind.

Solr is perhaps the best way to use Lucene search on any platform. Solr is also a Java application and you have to install it on some servlet container, like Tomcat. Once installed, you can access Solr from any development platform using REST protocol. You send add/update XML requests to the update URL (for example http://localhost:8080/solr/update) to index documents.

<add>
<doc>
<field name="id">05991</field>
<field name="description">This is a blog for webmasters and developers</field>
</doc>
</add>

On search, you send GET queries to the Solr select URL (for example http://localhost:8080/solr/select?q=test) and you will receive XML response from the Solr containing structured information with search results. Cool, right? Here you can find the list of very popular web sites that use Solr for search.

Sphinx

Sphinx logoSphinx is a full text search engine and it is ready-made to be used with open source databases like MySQL (MyISAM, InnoDB) and PostgreSql. It is very easy for installation. Configuration is simple as well. Basically, you will specify the SQL connection parameters and the SQL query to fetch the data from your database, and the Sphinx indexer will create the index for your database.

Indexing is very fast, but the problem is that there is no way to update index partially, for example to update/add only couple of records. Instead, you have to re-index an entire dataset from the beginning. However, there is a proposed solution that can be implemented automatically. You create a main index and delta index with recently added/updated records. You regularly re-index only delta index, while you re-index main only periodically. As far as I can see, this works very good in practice once you get it right, at least for my 1GB table, especially because, as I already sad, indexing is very fast. In my case, 5000+ documents delta index is re-indexed in a split second.

All major programming languages are supported, including PHP, Python, Java, Ruby and C++. Sphinx supports features like boolean search, phrase search, field search… It doesn’t support wildcards search, but I’m not crazy about that feature.

My pick

I have tried Lucene with Solr. It needs more coding to index data and an extra work to install and configure Tomcat and Solr. Although everything worked fine on my desktop, I had trouble installing Tomcat on the production server so I have given up Lucene and Solr for now.

On the other hand, Sphinx is very easy to install and configure. If you are switching from MySQL full text search, you only need to adjust your scripts to use sphinx API in combination with SQL instead of MySQL full text search query. I still haven’t launched the Sphinx search on the production server, but the test system is working fine. Basically, Sphinx has all the features I need, I only hope that they will add some live updates and/or partial re-indexing soon. :(

It is up to you. If you are not using Java for your project and you don’t want to spend extra time installing and configuring php-java-bridge or Tomcat/Solr, Sphinx might be the better and instant solution, especially if you are migrating from MySQL full text search. However, if your data sets are frequently updated, you will have to seriously consider can you organize partial indexing with Sphinx main+delta indexing scheme.

Lucene however is still known as the most advanced open source search engine with lots of search options, live updates and other things, so if you need all those functionalities, it might be better to go straight to Lucene and Solr.

Amazon Web Services, Facebook, API

Building Facebook applications on Amazon Web Services (AWS)

2008-02-05 | Comment?

Recently I have written about my first impressions on Facebook platform. Amazon Web Services today announced that they are teaming up with Facebook developers team to provide easy way to build instantly scalable Facebook applications on Amazon Web Services, S3 Simple Storage Service and EC2 Elastic Compute Cloud. According to Amazon, you can use their web services to build scalable Facebook applications which will ensure reliability, flexibility, and cost-effectiveness while the popularity of your Facebook applications grows.

I haven’t checked this in details, but if you are familiar with AWS and Facebook platform, you can take a look on this EC2 hosting tutorial, and hello world example. If you have some time, send your comments about your impressions on AWS and Facebook platform…

Domains

Domain tasting company have bought the domain in the front of your nose, how to prevent and react

2008-01-27 | Comment?

It is a good timing for a post about domain tasting since it was recently reported that Google is considering to stop serving AdSense ads on tasted domains. This is a good news, but Google AdSense is not the only revenue source for the domain tasting companies which means that you can still become a victim (yes, a victim!) of domain tasting.

How does it works

I have read about domain tasting for the first time on Bob Parson’s blog (GoDaddy CEO) and you can also check Wikipedia entry for more details. In brief:

  • Domain tasting companies abuse the 5 days grace period when purchasing domains with some registrants.
  • They buy domains and put ads on it.
  • After 5 days they use 5 days money back clause to cancel the domain.
  • Even if they had only one ad click in that period that earnd them $0.10 it is a pure profit since domain cost them $0 in 5-days period.
  • Multiply this with tens of thousands domains registered per day and you have a very profitable business (to be more accurate, according to Bob, 32.000.000 domains have been tasted back in May 2006! 32M in one single moth!?!).
  • Sometimes, they might hit the “jack pot”. For example, domain have earned whole $1 in 5 days. Domain tasters estimates that it will gain profit from that domain over the year so they usually keep those domains.

And they have registered the domain in the front of your nose?

How do the tasting companies decide which domains to purchase. Sometimes randomly perhaps or by trying different words combinations from the English dictionary. They also buy expired domains by default. But lots of people experienced that domain was registered in the front of their nose. One day you check the domain name and it is available. You come back tomorrow to buy it and the domain is sold?!?! And that doesn’t happen by accident, since the people who were searching for specific non-English word domains also experienced the problem. More likely is that you have some spyware installed on your computer.

Not so likely, you have some really evil spyware that track your searches on GoDaddy or any other domain registrants.

More likely
, some extension that you are using is selling your anonymously data to the domain tasters. If you visit some domain from your browser and the domain doesn’t exist, browser extension can detect that and send the data back to the server. Domain tasters would kill for that information, unregistered domain name that someone has accidentally visited is what they are looking for. Various companies buys anonymusly statistic data directly from extension developers, so watch it! Check the privacy policies of the extensions you are using! Even some useful browser extension might collect and sell anonymously traffic data to other companies. Usually, that kind of things won’t heart you, but in this case it might cost you to lose your desirable domain name.

How to prevent

Now when we do know how they are finding all those domain names to buy, it is easier to prevent it.

  • It might be a good idea to search for available domains by using whois from command line. :)
  • Check privacy policies of the extensions you are using.
  • If you have lots of browser extensions installed, use some other browser for domain searching. For example, if you are using Firefox, install Opera only for this purpose.
  • If you find a good domain name, buy it immediately even if you are not sure is that what you are looking for! Domains are cheap today. If you decide to pick another domain name for your site, you can easily sell that domain later on a domain after market and possibly even earn a few dollars.

And how to react

But if it is already to late, and domain tasters have taken the domain in the front of your nose, don’t panic. The good thing is that you are now aware of domain tasting business model and how does it works.

  • Check the whois for the domain and see who have registered the domain and when.
  • Search for more info on the company that have registered your domain. Do a little Google research and you will find in which context the company is mentioned on forums and blogs. For example, when the friend of mine experienced the domain tasting problem, we have searched for the company and found this and similar posts. In that moment we knew that domain tasting company have bought the domain.
  • If you realize that domain tasting company have bought your domain, there is a great chance that domain will be available again in 5 days. But, you have to be COOL!
  • Don’t visit the domain in the next 5 days and even if you do by accident, don’t click on anything. Don’t generate any traffic or profit to them since domain tasting company might decide to keep your domain.
  • Don’t tell anyone! Don’t write to your friend over IM that someone grab the xyz.com in the front of your nose since your friend will probably click on the link to see the homepage. And of course, don’t blog or twitter about it as well. :) Just sit and wait.
  • If you stay cool and patient, your domain will probably become available again in 5 days so go register it as quickly as possible! :)

Smarty, PHP

Using Smarty PHP template engine

2008-01-24 | Comment?

I believe that everyone have started to use PHP by adding some dynamic content to static HTML pages. Usually you would just add <?php ?> inside your HTML code. But over the time, if you add more and more PHP code, database support, etc. your scripts might become really messy, which is not good. Also, over the time PHP evolved from Hypertext Preprocessor in serious programming language which caused a different approach to PHP sites/applications. It is a bad practice these days to mix your PHP and HTML, logic and presentation. Instead you should try to distinguish your PHP and logic from the presentation (HTML and PHP).

You can use PHP for that as well. Separate your logic and presenation in two groups of PHP scripts. You would do all calculation and data preparation in the first group of scripts and then you will present the data with other group of scripts which would usually contain HTML code with the addition of dynamically generated content by printing it with <?php echo $my_var;?>. But if you have ever edited Word Press theme, you will see that HTML code becomes unreadable with the heavy usage of the <?php ?> elements inside.

That is where Smarty comes. You can write very clean HTML templates by using Smarty variables and Smarty commands. Smarty will also make the intermediate PHP code from your template that looks alike Word Press themes I have just mentioned, but you don’t have to worry about that. All you have to do is to make the pure HTML templates. Intermediate PHP code will be cached so your performances won’t be decreased.

Smarty is very simple tool. You create the object, assign values to Smarty variables and on the end of the script, you will call the display method to publish the page.

$smarty->assign('title', 'Hello, World');
$smarty->assign('message', 'some text');
$smarty->display('welcome.tpl');

Why I really like Smarty is that you can easily present the data in many different formats. For example, I use the same code base to retrieve data form the database and prepare objects and then use the different smarty templates to publish content for the different platforms.

Using Smarty

Lets take a look on the diagram above. Various scripts use my base classes to retrieve data from the databases. Once formated and prepared for publishing, by using different Smarty templates, the same data sets can be used to produce various content like web page, rss feed, content for mobile devices or even FBML for Facebook application.

PHP, eAccelerator

eAccelerator

2008-01-15 | Comment?

Problem with script languages is that the scripts have to be parsed and compiled each time when the script is executed. Each time when you visit some PHP page, your server must parse and compile the script before executing. Can you imagine how much CPU time would be saved if those scripts are parsed and compiled only once, and then executed multiple times? Well, you don’t have to imagine because softwares, like eAccelerator, do exactly that.

eAccelerator is a free open-source PHP accelerator, optimizer, and dynamic content cache. It increases the performance of PHP scripts by caching them in their compiled state, so that the overhead of compiling is almost completely eliminated. It also optimizes scripts to speed up their execution. eAccelerator typically reduces server load and increases the speed of your PHP code by 1-10 times.

Installation

If you have the dedicated server, and you seek the way to improve your server performances, I would highly recommended you to install the eAccellerator. Although I’m not very experienced Linux admin, I had no problems with the installation by following the installation from source guide. Before compilation check all compile parameters if you want to enable additional functions like eaccelerator_put and eaccelerator_get (I will write about using shared memory functions in some other post). If you are running Windows machine at home, here you can find eAccelerator Windows binaries.

Configuration

You can find details about all eAccelerator configuration variables here. Probably the most important variable is the eaccelerator.shm_size which represents the size of the memory dedicated to the eAccelerator. How much memory you have to dedicate to the eAccelerator? Well, with the eAccelerator you will also get the control.php that you should copy on your web server. By navigating your browser to the control.php you will see the eAccelerator memory usage and other things on the web page. Set some starting shm_size value and monitor this page first few days and then see should you increase or decrease memory size for eAccelerator. From this control panel you can also easily clear the eAccelerator cache from the memory.

eAccelerator Control Panel

If eAccelerator is unable to cache the script in memory, it will by default use file system for the caching by saving cached content inside the directory specified with eaccelerator.cache_dir. If you want to use only memory for the caching, set eaccelerator.shm_only to 1. Also check the eaccelerator.shm_ttl and eaccelerator.shm_prune_period. This variables tells eAccelerator which old scripts to remove from memory if there is not enough shared memory to cache a new script.

Facebook, API

Facebook platform

2008-01-08 | 4 Comments

facebook platform logoRecently I have played a little bit with the Facebook platform and Facebook application development. My aim was not to become Facebook expert, I just wanted to see what they have to offer and how quickly I can assemble a simple Facebook application.

Facebook platform contains several key components:

Interface (API) is the set of methods which you can use to access user data, friends lists, groups, photos etc. There are also data methods in beta which allows you to use Facebook as storage for your applications. Basically, you access API through REST interface, but you should download and use existing Facebook client libraries. Client libraries exists for all popular web programming languages. So instead of building REST request, sending it to the Facebook server and receiving and parsing the results, you will just call the predefined method from the Facebook library (php example: $facebook->api_client->friends_list) and you will get the results in array or some other data type, depending of method you have just called.

FBML is the Facebook extension of the XHTML which allows you to put standard Facebook components on your application canvas.

FBJS is the Facebook Javascript. Although regular Javascript will usually work, Facebook developed some extended Javascript objects and methods which help your application to be executed in its own name space among other Facebook apps.

FQL is Facebook query language. As far as I have seen, you can obtain similar data with FQL like you can with the API, but in this case you use SQL syntax to retrieve friends list and other stuff.

Getting started

After you check the anatomy of Facebook applications it is the best practice to start your own hello world application by following the step by step guide. Very soon you will have your own Hello World application and after you have learned basic stuff it is up to you to upgrade your application and build anything you want on top of it.

In brief, your application is hosted on your web site and hosting. You put your application in, for example http://yourdomain.com/myfacebookapp/ and that will be your „callback URL“. Any access to the http://apps.facebook.com/yourappname/somepage.php will cause Facebook to fetch the page from http://yourdomain.com/myfacebookapp/somepage.php and display the script’s output inside your application canvas. Of course, Facebook will not just display the output as-is. It will first transform your FMBL in regular XHTML by replacing FMBL tags with HTML elements with appropriate css class names. Application will be put in its own name space, for example your <div id=’header’> will become <div id=’app688273534_header’>. Similar thing will happen with Javascript (FBJS) and styles that reference that element using it’s id. For example #header {color: #c0c0c0;} will become #app688273534_header {color: #c0c0c0;} Oh, and by the way, you can’t link your external css files, you have to embed styles with <style> tag inside your FMBL page.

First impressions

Building Facebook apps is not a rocket science. Basically you just build web pages as usual but you use some additional FMBL elements. At least that was the idea. Unfortunately, you will not find FMBL elements for some common Facebook components that you can see in Facebook apps, so you will be forced to create your own divs and spans and assign Facebook css classes or even create your own css classes in Facebook style to get Facebook look and feel. On the other hand, some existing FMBL elements are not customizable enough, and again, you will have to create your own HTML to imitate look & feel of Facebook apps.

For example, this code will produce the common header menu for your application.

<fb:dashboard>
<fb:action href="new.php">Create a new photo album</fb:action>
<fb:action href="you.php">Photos of You</fb:action>
</fb:dashboard>

facebook dashboard

As you can see, it will also include your application name. Lets say that you don’t want your application name to be displayed here, you want to put your content just below the menu or you want to put your logo instead of name, you can’t do that because <fb:dashboard> will by default print your application name and you only have the property to turn application icon on or off on the left side of your application name. Anyway, you will have to make your own header menu without using <fb:dashboard> to avoid application name appearance.

I hope that Facebook will make lots of this fb elements more customizable and that they will offer even more elements, some of them are already in betas. Since I’m not the only one that thinks that the list of fb elements should be expanded, there is a place where you can suggest and request new fbml elements.

Web search

Keep the visitors on your site and reduce the bounce rate

2007-12-18 | Comment?

My news aggregation site receives a lot of traffic from Google, which is great of course. Most of the hits from Google comes from the long tail for various keywords and usually don’t hit the homepage or some of the main categories but some secondary pages. Unfortunately, lots of users from those hits don’t hang up on the site long enough. They find the article they want and proceed to the other site to read the article.

Although the user base of my regular visitors are growing, I wanted to tease even more new visitors to explore the site little bit more before they leave it. I have introduced a few tweaks on my secondary pages that shows up to the visitors referred by the Google search.

New window

The first thing I have tried is to put target=“_blank“ on all external links if the visitor is referred to the site from Google Search. Usually I let external links open in the same window, but if the user is referred from Google Search, I don’t want just to push the user on the other site and let him forget about my website. The user will at least see your site once more, when he returns to close the window/tab, but if you offer quality content and service, it might also explore your site.

Site introduction, link to homepage

I have offered visible color box with welcome message explaining my site’s niche. So perhaps the visitors just hit the page searching for some gossip, but I want to let him know that I offer great news aggregation on my home page and that he should visit my homepage to see it. I’m not afraid to make these box clearly visible and colored since it will be visible only to visitors that are coming from Google search, not to my regular visitors.

Using Google keywords to provide search on your own site

The last, and most efficient method I have used is to use keywords that visitor have searched for on Google and offer the visitor a search on my site for the same keywords. All this information are contained in the referrer url, you just have to extract them. So if the users searched for paris hilton and ended up on some page of your site, you can offer him to search all of your content for the keywords paris hilton.

If you have configured some AdWords campaign, you know that Google suggest you to use the keywords you have targeted in the title of your ad. Same principle works here, when the visitor sees the keywords he just typed, there is much more possibility that the visitor will click on the link. If you have any kind of search on your site/blog you can implement the same method.

Conclusion and results

Visits on my site are growing constantly and because of this I can’t measure the exact effects of these methods. Although the bounce rate is still the same or 1-2% reduced, the number of visitors in total is growing so I’m satisfied with the performance.

However, I do count the clicks on links that I have described in sections two and tree. These links record over 300 clicks daily. That might be 300 new potential regular users every day, and that is great.

« Previous Entries