com.norconex.commons.lang.url
Class URLNormalizer

java.lang.Object
  extended by com.norconex.commons.lang.url.URLNormalizer
All Implemented Interfaces:
Serializable

public class URLNormalizer
extends Object
implements Serializable

The general idea behind URL normalization is to make different URLs "equivalent" (i.e. eliminate URL variations pointing to the same resource). To achieve this, URLNormalizer takes a URL and modifies it to its most basic or standard form (for the context in which it is used). Of course URLNormalizer can simply be used as a generic URL manipulation tool for your needs.

You would typically "build" your normalized URL by invoking each method of interest, in the relevant order, using a similar approach:

 String url = "Http://Example.com:80//foo/index.html";
 URL normalizedURL = new URLNormalizer(url)
         .lowerCaseSchemeHost()
         .removeDefaultPort()
         .removeDuplicateSlashes()
         .removeDirectoryIndex()
         .addWWW()
         .toURL();
 System.out.println(normalizedURL.toString());
 // Output: http://www.example.com/foo/

Several normalization methods implemented come from the RFC 3986 standard. These standards and several more normalization techniques are very well summarized on the Wikipedia article titled URL Normalization. This class implements most normalizations described on that article and borrows several of its examples, as well as a few additional ones.

The normalization methods available can be broken down into three categories:

Preserving Semantics

The following normalizations are part of the RFC 3986 standard and should result in equivalent URLs (one that identifies the same resource):

Usually Preserving Semantics

The following techniques will generate a semantically equivalent URL for the majority of use cases but are not enforced as a standard.

Not Preserving Semantics

These normalizations will fail to produce semantically equivalent URLs in many cases. They usually work best when you have a good understanding of the website behind the supplied URL and whether for that site, which normalizations can be be considered to produce semantically equivalent URLs or not.

Refer to each methods below for description and examples (or click on a normalization name above).

Author:
Pascal Essiembre
See Also:
Serialized Form

Constructor Summary
URLNormalizer(String url)
          Create a new URLNormalizer instance.
URLNormalizer(URL url)
          Create a new URLNormalizer instance.
 
Method Summary
 URLNormalizer addTrailingSlash()
          Adds a trailing slash (/) to a URL ending with a directory.
 URLNormalizer addWWW()
          Adds "www." domain name prefix.
 URLNormalizer decodeUnreservedCharacters()
          Decodes percent-encoded unreserved characters.
 URLNormalizer lowerCaseSchemeHost()
          Converts the scheme and host to lower case.
 URLNormalizer removeDefaultPort()
          Removes the default port (80 for http, and 443 for https).
 URLNormalizer removeDirectoryIndex()
          Removes directory index files.
 URLNormalizer removeDotSegments()
          Removes the unnecessary "." and ".." segments from the URL path.
 URLNormalizer removeDuplicateSlashes()
          Removes duplicate slashes.
 URLNormalizer removeEmptyParameters()
          Removes empty parameters.
 URLNormalizer removeFragment()
          Removes the URL fragment (from the "#" character until the end).
 URLNormalizer removeSessionIds()
          Removes a URL-based session id.
 URLNormalizer removeTrailingQuestionMark()
          Removes trailing question mark ("?").
 URLNormalizer removeWWW()
          Removes "www." domain name prefix.
 URLNormalizer replaceIPWithDomainName()
          Replaces IP address with domain name.
 URLNormalizer secureScheme()
          Converts http scheme to https.
 URLNormalizer sortQueryParameters()
          Sorts query parameters.
 String toString()
          Returns the normalized URL as string.
 URI toURI()
          Returns the normalized URL as URI.
 URL toURL()
          Returns the normalized URL as URL.
 URLNormalizer unsecureScheme()
          Converts https scheme to http.
 URLNormalizer upperCaseEscapeSequence()
          Converts letters in URL-encoded escape sequences to upper case.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

URLNormalizer

public URLNormalizer(URL url)
Create a new URLNormalizer instance.

Parameters:
url - the url to normalize

URLNormalizer

public URLNormalizer(String url)
Create a new URLNormalizer instance.

Parameters:
url - the url to normalize
Method Detail

lowerCaseSchemeHost

public URLNormalizer lowerCaseSchemeHost()
Converts the scheme and host to lower case.

HTTP://www.Example.com/ → http://www.example.com/

Returns:
this instance

upperCaseEscapeSequence

public URLNormalizer upperCaseEscapeSequence()
Converts letters in URL-encoded escape sequences to upper case.

http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b

Returns:
this instance

decodeUnreservedCharacters

public URLNormalizer decodeUnreservedCharacters()
Decodes percent-encoded unreserved characters.

http://www.example.com/%7Eusername/ → http://www.example.com/~username/

Returns:
this instance

removeDefaultPort

public URLNormalizer removeDefaultPort()
Removes the default port (80 for http, and 443 for https).

http://www.example.com:80/bar.html → http://www.example.com/bar.html

Returns:
this instance

addTrailingSlash

public URLNormalizer addTrailingSlash()

Adds a trailing slash (/) to a URL ending with a directory. A URL is considered to end with a directory if the last path segment, before fragment (#) or query string (?), does not contain a dot, typically representing an extension.

Please Note: URLs do not always denote a directory structure and many URLs can qualify to this method without truly representing a directory. Adding a trailing slash to these URLs could potentially break its semantic equivalence.

http://www.example.com/alice → http://www.example.com/alice/

Returns:
this instance

removeDotSegments

public URLNormalizer removeDotSegments()

Removes the unnecessary "." and ".." segments from the URL path. URI.normalize() is invoked to perform this normalization. Refer to it for exact behavior.

http://www.example.com/../a/b/../c/./d.html → http://www.example.com/a/c/d.html

Please Note: URLs do not always represent a clean hierarchy structure and the dots/double-dots may have a different signification on some sites. Removing them from a URL could potentially break its semantic equivalence.

Returns:
this instance
See Also:
URI.normalize()

removeDirectoryIndex

public URLNormalizer removeDirectoryIndex()

Removes directory index files. They are often not needed in URLs.

http://www.example.com/a/index.html → http://www.example.com/a/

Index files must be the last URL path segment to be considered. The following are considered index files:

Please Note: There are no guarantees a URL without its index files will be semantically equivalent, or even be valid.

Returns:
this instance

removeFragment

public URLNormalizer removeFragment()

Removes the URL fragment (from the "#" character until the end).

http://www.example.com/bar.html#section1 → http://www.example.com/bar.html

Returns:
this instance

replaceIPWithDomainName

public URLNormalizer replaceIPWithDomainName()

Replaces IP address with domain name. This is often not reliable due to virtual domain names and can be slow, as it has to access the network.

http://208.77.188.166/ → http://www.example.com/

Returns:
this instance

unsecureScheme

public URLNormalizer unsecureScheme()

Converts https scheme to http.

https://www.example.com/ → http://www.example.com/

Returns:
this instance

secureScheme

public URLNormalizer secureScheme()

Converts http scheme to https.

http://www.example.com/ → https://www.example.com/

Returns:
this instance

removeDuplicateSlashes

public URLNormalizer removeDuplicateSlashes()

Removes duplicate slashes. Two or more adjacent slash ("/") characters will be converted into one.

http://www.example.com/foo//bar.html → http://www.example.com/foo/bar.html

Returns:
this instance

removeWWW

public URLNormalizer removeWWW()

Removes "www." domain name prefix.

http://www.example.com/ → http://example.com/

Returns:
this instance

addWWW

public URLNormalizer addWWW()

Adds "www." domain name prefix.

http://example.com/ → http://www.example.com/

Returns:
this instance

sortQueryParameters

public URLNormalizer sortQueryParameters()

Sorts query parameters.

http://www.example.com/?z=bb&y=cc&z=aa → http://www.example.com/?y=cc&z=bb&z=aa

Returns:
this instance

removeEmptyParameters

public URLNormalizer removeEmptyParameters()

Removes empty parameters.

http://www.example.com/display?a=b&a=&c=d&e=&f=g → http://www.example.com/display?a=b&c=d&f=g

Returns:
this instance

removeTrailingQuestionMark

public URLNormalizer removeTrailingQuestionMark()

Removes trailing question mark ("?").

http://www.example.com/display? → http://www.example.com/display

Returns:
this instance

removeSessionIds

public URLNormalizer removeSessionIds()

Removes a URL-based session id. It removes PHP (PHPSESSID), ASP (ASPSESSIONID), and Java EE (jsessionid) session ids.

http://www.example.com/servlet;jsessionid=1E6FEC0D14D044541DD84D2D013D29ED?a=b → http://www.example.com/servlet?a=b

Please Note: Removing session IDs from URLs is often a good way to have the URL return an error once invoked.

Returns:
this instance

toString

public String toString()
Returns the normalized URL as string.

Overrides:
toString in class Object
Returns:
URL

toURI

public URI toURI()
Returns the normalized URL as URI.

Returns:
URI

toURL

public URL toURL()
Returns the normalized URL as URL.

Returns:
URI


Copyright © 2008-2013 Norconex Inc.. All Rights Reserved.