org.hd.d.pg2k.svrCore
Class TextUtils

java.lang.Object
  extended by org.hd.d.pg2k.svrCore.TextUtils

public final class TextUtils
extends java.lang.Object

Some simple common text utilities of scope throughout the application.


Nested Class Summary
private static class TextUtils.Base64Cache
          Private to encode8To6()/decode8To6(); automagically created on first access.
static interface TextUtils.CharSequence7Bit
          Extension (marker interface) for a CharSequence that holds only 7-bit character values.
static interface TextUtils.CharSequence8Bit
          Extension for a CharSequence that holds only 8-bit character values.
 
Field Summary
static java.util.Comparator<java.lang.CharSequence> CASE_INSENSITIVE_ORDER
          Orders CharSequence objects as if by String.compareToIgnoreCase(); not null.
static java.util.Comparator<java.lang.CharSequence> CASE_SENSITIVE_ORDER
          Orders CharSequence objects as if by String.compareTo(); not null.
private static int MIN_NAME_CHARS_FOR_EFFICIENT_TEXT_REPRESENTATION
          Minimum-character entity to hold as Name with its high overheads; non-negative.
private static java.lang.String[] sizeSuffixes
          Suffixes used by sizeAsText.
private static java.util.regex.Pattern wordBoundaryRegex
          Compiled regex to match inter-word gaps in plain (mainly English) text; non-null.
private static java.lang.String XML_LINE_TERM
          String to write as line terminator when converting to XML.
 
Constructor Summary
private TextUtils()
          Prevent construction of an instance.
 
Method Summary
static int compare(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
          Compares two (non-null) CharSequences for lexical order.
static boolean contentEquals(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
          Checks that two (non-null) CharSequences represent the same sequence of chars.
static boolean contentEquals(TextUtils.CharSequence8Bit cs1, TextUtils.CharSequence8Bit cs2)
          Checks that two (non-null) 8-bit CharSequences represent the same sequence of chars/bytes.
static boolean contentEqualsIgnoreCase(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
          Checks that two (non-null) CharSequences represent the same sequence of chars, ignoring case.
static boolean contentEqualsOrBothNull(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
          Checks that two CharSequences are both null or represent the same (non-null) sequence of chars.
static
<T extends java.lang.CharSequence>
java.util.SortedSet<T>
createCharSequenceSortedSet(java.util.Collection<T> csc)
          Create and populate a SortedSet of CharSequence in natural/total case-sensitive order that will work with any mix of immutable CharSequence keys; never null.
static
<T extends java.lang.CharSequence>
java.util.SortedSet<T>
createCharSequenceSortedSet(java.util.Collection<T> csc, java.util.Comparator<java.lang.CharSequence> comparator)
          Create and populate a SortedSet of CharSequence in specified order that will work with any mix of immutable CharSequence keys; never null.
static byte[] decode8To6(java.lang.String base64Text)
          Decode to a byte array (8 bit) from ASCII Base-64 (6 bit); never null.
static java.lang.String encode8To6(byte[] data8Bit)
          Encode a byte array (8 bit) in ASCII Base-64 (6 bit); never null.
static boolean endsWith(java.lang.CharSequence mainText, java.lang.CharSequence putativeSuffix)
          Returns true if the first sequence ends with the second (neither null), else false.
static java.lang.String escapeHTMLMetaChars(java.lang.String in)
          Rewrite HTML so that it displays as "raw" text and is safe to use in attribute values.
static int hashCode(java.lang.CharSequence cs)
          Return a hash code the same as or similar to that of a String containing the same characters.
static java.lang.String hashCodeHexString(java.lang.CharSequence cs)
          Return an ASCII printable hex hash code; never null nor empty.
static void importCopy(org.w3c.dom.Node dest, org.w3c.dom.Node src)
          Recursively copy second Node and contents into first node.
static int indexOf(java.lang.CharSequence cs, char c)
          First index of specified character in given (non-null) CharSequence as for String.
static int indexOf(java.lang.CharSequence cs, char c, int fromIndex)
          First index of specified character in given (non-null) CharSequence from/after specified index as for String.
static boolean isASCII7(java.lang.String s)
          Returns true if given String is 7-bit clean, is is pure ASCII, or is null.
static int lastIndexOf(java.lang.CharSequence cs, char c)
          Last index of specified character in given (non-null) CharSequence as for String.
static int lastIndexOf(java.lang.CharSequence cs, char c, int fromIndex)
          Last index of specified character in given (non-null) CharSequence from/before specified start index as for String.
static java.lang.CharSequence makeEfficientTextRepresentation(java.lang.CharSequence value, java.util.concurrent.atomic.AtomicReference<Name> prevRef)
          Make an efficient representation for possibly non-unique text to be held in memory long-term.
static int quickWordCount(java.lang.CharSequence text)
          Quickly attempt to count the words in plain text; non-negative.
static boolean regionMatches(java.lang.CharSequence cs1, int start1, java.lang.CharSequence cs2, int start2, int len)
          Checks that the specified region of two (non-null) CharSequences matches exactly (as for String).
static java.lang.String sanitiseForXML(java.lang.String input, int maxLen, boolean allowEntities)
          Sanitise text for XHTML/HTML/WML/XML use.
static java.lang.String sizeAsText(long size, boolean abbrev)
          Renders size in bytes as text, abbreviated if requested.
static boolean startsWith(java.lang.CharSequence mainText, java.lang.CharSequence putativePrefix)
          Returns true if the first sequence starts with the second (neither null), else false.
static java.lang.String toString(byte[] ascii)
          Efficient conversion from 7-bit or 8-bit text (once char per byte) to String; never null.
static java.lang.String toXML(org.w3c.dom.Node node, boolean toXHTML, boolean terse)
          Write DOM tree as XML/XHTML String; never "" nor null.
static void toXML(java.lang.StringBuilder result, org.w3c.dom.Node node, boolean toXHTML, boolean terse)
          Write DOM tree as XML/XHTML String; appends to supplied StringBuilder.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

sizeSuffixes

private static final java.lang.String[] sizeSuffixes
Suffixes used by sizeAsText.


XML_LINE_TERM

private static final java.lang.String XML_LINE_TERM
String to write as line terminator when converting to XML. A simple LF would suit UNIX-like systems though would probably be accepted by most. A bulkier CRLF is more like the Internet "standard".

See Also:
Constant Field Values

CASE_SENSITIVE_ORDER

public static final java.util.Comparator<java.lang.CharSequence> CASE_SENSITIVE_ORDER
Orders CharSequence objects as if by String.compareTo(); not null. If both arguments to compare() are the same reference (even null) this returns 0 immediately.


CASE_INSENSITIVE_ORDER

public static final java.util.Comparator<java.lang.CharSequence> CASE_INSENSITIVE_ORDER
Orders CharSequence objects as if by String.compareToIgnoreCase(); not null. If both arguments to compare() are the same reference (even null) this returns 0 immediately.


MIN_NAME_CHARS_FOR_EFFICIENT_TEXT_REPRESENTATION

private static final int MIN_NAME_CHARS_FOR_EFFICIENT_TEXT_REPRESENTATION
Minimum-character entity to hold as Name with its high overheads; non-negative. If smaller than this then CS8Bit may be used instead, with internal de-duping of values and keys rather than the global intern()-driven system used by Name.

Empirically determined by distribution of sizes of keys and values!

See Also:
Constant Field Values

wordBoundaryRegex

private static final java.util.regex.Pattern wordBoundaryRegex
Compiled regex to match inter-word gaps in plain (mainly English) text; non-null.

Constructor Detail

TextUtils

private TextUtils()
Prevent construction of an instance.

Method Detail

sizeAsText

public static final java.lang.String sizeAsText(long size,
                                                boolean abbrev)
Renders size in bytes as text, abbreviated if requested. Can produce normal or abbreviated output.


isASCII7

public static boolean isASCII7(java.lang.String s)
Returns true if given String is 7-bit clean, is is pure ASCII, or is null.


importCopy

public static final void importCopy(org.w3c.dom.Node dest,
                                    org.w3c.dom.Node src)
Recursively copy second Node and contents into first node. This works even when the second Node is from a different Document to the first and importNode() does not work (eg due to JDK1.5 impl bug).


toXML

public static final java.lang.String toXML(org.w3c.dom.Node node,
                                           boolean toXHTML,
                                           boolean terse)
Write DOM tree as XML/XHTML String; never "" nor null. Can write as terse/efficient single-line form or longer more human-friendly format.

When writing XHTML we make text nodes XML/HTML-safe, though allow embedded entities.

Parameters:
node - DOM tree root; never null
toXHTML - if true write with formatting suitable to include directly in HTML/XHTML for human consumption (eg using dl/ul/ol nested lists to produce formatted text representing the structure) rather than a pure XML representation of the Node tree itself
terse - if true write as compactly as possible, else make more human-readable if possible

toXML

public static final void toXML(java.lang.StringBuilder result,
                               org.w3c.dom.Node node,
                               boolean toXHTML,
                               boolean terse)
Write DOM tree as XML/XHTML String; appends to supplied StringBuilder. Can write as terse/efficient single-line form or longer more human-friendly format.

When writing XHTML we make text nodes XML/HTML-safe, though allow embedded entities.

Parameters:
node - DOM tree root; never null
toXHTML - if true write with formatting suitable to include directly in HTML/XHTML for human consumption (eg using dl/ul/ol nested lists to produce formatted text representing the structure) rather than a pure XML representation of the Node tree itself
terse - if true write as compactly as possible, else make more human-readable if possible

escapeHTMLMetaChars

public static java.lang.String escapeHTMLMetaChars(java.lang.String in)
Rewrite HTML so that it displays as "raw" text and is safe to use in attribute values. Meta-characters (<, >, & " ') are rewritten to entity codes so that when shown on screen it looks like the raw HTML text when displayed in-line in HTML, and so as to avoid problems when embedded in XML and HTML.

If the input text does not contain these meta-characters then it is returned unchanged.

Additionally, all characters < 32 (ASCII control characters) are converted to spaces.

If null is passed in, then this returns null, to simplify its use in some cases where a null might be present.

Parameters:
in - String possibly containing HTML/XML meta-characters

decode8To6

public static byte[] decode8To6(java.lang.String base64Text)
Decode to a byte array (8 bit) from ASCII Base-64 (6 bit); never null. The input data must always be a multiple of 4 characters, and all character must come from the standard Base64 set [A-Za-z0-9+/=].

Parameters:
base64Text - data in base-46, eg as encoded by encode8To6(); never null
Returns:
binary data; never null

encode8To6

public static java.lang.String encode8To6(byte[] data8Bit)
Encode a byte array (8 bit) in ASCII Base-64 (6 bit); never null. With thanks to Apache.

Parameters:
data8Bit - binary input data; never null
Returns:
ASCII base-64 representation as String; never null

sanitiseForXML

public static java.lang.String sanitiseForXML(java.lang.String input,
                                              int maxLen,
                                              boolean allowEntities)
Sanitise text for XHTML/HTML/WML/XML use. The transformations are:

Our simple definition of a syntactically-correct entity is that it may contain any selection of ASCII digits and letters and optionally a leading hash (`#'), but certainly no whitespace. We may enforce a length limit.

If the input string is OK it is returned untouched.

TODO: Needs version with extra argument to allow entity sequences (of the form '&'xxx;) to be treated as single characters and to delete any '<'...'>' sequences, ie to do a more sophisticated job of sanitising XML/HTML that has some simple markup (primarily entity codes needed for i18n). TODO: jUnit tests

Parameters:
maxLen - if positive the output is limited to at most this number of characters; if >= 3 truncated text is marked with a trailing ``...'' in the last three positions
allowEntities - if true, treats HTML/XML entity sequences as if single characters; the entities are vaguely tested for correct syntax and may not be allowed if invalid

indexOf

public static final int indexOf(java.lang.CharSequence cs,
                                char c)
First index of specified character in given (non-null) CharSequence as for String.

Returns:
index of first occurrence of c, else -1 if no occurrence of c

indexOf

public static final int indexOf(java.lang.CharSequence cs,
                                char c,
                                int fromIndex)
First index of specified character in given (non-null) CharSequence from/after specified index as for String.

Returns:
index of first occurrence of c, else -1 if no occurrence of c

lastIndexOf

public static final int lastIndexOf(java.lang.CharSequence cs,
                                    char c)
Last index of specified character in given (non-null) CharSequence as for String.

Returns:
index of last occurrence of c, else -1 if no occurrence of c

lastIndexOf

public static final int lastIndexOf(java.lang.CharSequence cs,
                                    char c,
                                    int fromIndex)
Last index of specified character in given (non-null) CharSequence from/before specified start index as for String.

Returns:
index of last occurrence of c, else -1 if no occurrence of c

regionMatches

public static boolean regionMatches(java.lang.CharSequence cs1,
                                    int start1,
                                    java.lang.CharSequence cs2,
                                    int start2,
                                    int len)
Checks that the specified region of two (non-null) CharSequences matches exactly (as for String).


contentEqualsOrBothNull

public static boolean contentEqualsOrBothNull(java.lang.CharSequence cs1,
                                              java.lang.CharSequence cs2)
Checks that two CharSequences are both null or represent the same (non-null) sequence of chars.


contentEquals

public static boolean contentEquals(TextUtils.CharSequence8Bit cs1,
                                    TextUtils.CharSequence8Bit cs2)
Checks that two (non-null) 8-bit CharSequences represent the same sequence of chars/bytes.


contentEquals

public static boolean contentEquals(java.lang.CharSequence cs1,
                                    java.lang.CharSequence cs2)
Checks that two (non-null) CharSequences represent the same sequence of chars.


contentEqualsIgnoreCase

public static boolean contentEqualsIgnoreCase(java.lang.CharSequence cs1,
                                              java.lang.CharSequence cs2)
Checks that two (non-null) CharSequences represent the same sequence of chars, ignoring case.


compare

public static int compare(java.lang.CharSequence cs1,
                          java.lang.CharSequence cs2)
Compares two (non-null) CharSequences for lexical order.


startsWith

public static boolean startsWith(java.lang.CharSequence mainText,
                                 java.lang.CharSequence putativePrefix)
Returns true if the first sequence starts with the second (neither null), else false.


endsWith

public static boolean endsWith(java.lang.CharSequence mainText,
                               java.lang.CharSequence putativeSuffix)
Returns true if the first sequence ends with the second (neither null), else false.


createCharSequenceSortedSet

public static <T extends java.lang.CharSequence> java.util.SortedSet<T> createCharSequenceSortedSet(java.util.Collection<T> csc)
Create and populate a SortedSet of CharSequence in natural/total case-sensitive order that will work with any mix of immutable CharSequence keys; never null. This may not be consistent with equals() if keys are of mixed types.

The result is not thread-safe and is mutable.

Parameters:
csc - initial collection to populate from; can be null to avoid initial population

createCharSequenceSortedSet

public static <T extends java.lang.CharSequence> java.util.SortedSet<T> createCharSequenceSortedSet(java.util.Collection<T> csc,
                                                                                                    java.util.Comparator<java.lang.CharSequence> comparator)
Create and populate a SortedSet of CharSequence in specified order that will work with any mix of immutable CharSequence keys; never null. This may not be consistent with equals() if keys are of mixed types.

The result is not thread-safe and is mutable.

Parameters:
csc - initial collection to populate from; can be null to avoid initial population
comparator - used to order the items; never null

makeEfficientTextRepresentation

public static java.lang.CharSequence makeEfficientTextRepresentation(java.lang.CharSequence value,
                                                                     java.util.concurrent.atomic.AtomicReference<Name> prevRef)
Make an efficient representation for possibly non-unique text to be held in memory long-term. Does not make reference to other existing values that may be held together.

A null input results in a null.

A non-8-bit (or empty) text results in a String result.

A short (8-bit) text will result in an non-intern()ed CS8Bit result.

Else an (implicitly intern()ed) Name is returned.

This aims to avoid intern() overhead for small values.

Parameters:
value - the value to optimise
prevRef - if non-null, used to convey Name values from one call to the next to help provide a more-compact representation by sharing prefixes and suffixes

toString

public static java.lang.String toString(byte[] ascii)
Efficient conversion from 7-bit or 8-bit text (once char per byte) to String; never null. The array passed is not altered in any way.

Parameters:
ascii - text; never null but may be empty
Returns:
new String instance; never null

hashCode

public static int hashCode(java.lang.CharSequence cs)
Return a hash code the same as or similar to that of a String containing the same characters.


hashCodeHexString

public static java.lang.String hashCodeHexString(java.lang.CharSequence cs)
Return an ASCII printable hex hash code; never null nor empty.


quickWordCount

public static int quickWordCount(java.lang.CharSequence text)
Quickly attempt to count the words in plain text; non-negative.


DHD Multimedia Gallery V1.60.69

Copyright (c) 1996-2012, Damon Hart-Davis. All rights reserved.