Yunqa • The Delphi Inspiration

Delphi Components and Applications

User Tools

Site Tools


products:unicode:index

DIUnicode

DIUnicode provides Unicode text reader and writer classes with automatic conversion from and to 144 character sets and encodings for Delphi (Embarcadero, CodeGear, Borland).

Overview

DIUnicode's Pascal implementation features more than 70 encodings, like UTF-7, UTF-8, UTF-16, the ISO-8859 family, various Windows and Macintosh codepages, KOI8 character sets, Chinese GB18030, and more. Adding a new character coding is as simple as writing a single conversion procedure. It supports 144 character sets and encodings when linked against DIConverters.

Key Benefits

DIUnicode is for you if your application needs to handle text with multiple character encodings with high performance and little development time.

Both the Unicode Reader and the Unicode Writer work with strings, buffers, and streams. You can, for example, directly read from or write to database BLOB streams avoiding all temporary storage of your data.

An efficient buffering system guarantees excellent performance, even when processing huge files.

Simple Usage Examples

DIUnicode makes reading and writing Unicode as simple as ASCII text, regardless of the character set or encoding you are processing. the code snippets below show some of the techniques usually applied with TDIUnicodeReader, the reader class of DIUnicode. Remember that you can use the parsing routine unchanged with any of the available encodings.

Read entire lines from a Unicode text file:

{ Setup and initialize. }
Reader := TDIUnicodeReader.Create(nil);
{ Let's say we want to read UTF-8.
  This could well be any other
  character encoding. }
Reader.ReadMethods := Read_Utf_8;
Reader.SourceStream :=
  TFileStream.Create('MyFile.txt', fmOpenRead);
{ Now the actual reading: }
while Reader.ReadLine do
  begin
    TheLine := Reader.DataAsStrW;
    { Your code to process the line
      goes here. }
  end;

Read individual characters only:

while Reader.ReadChar do
  begin
    TheChar := Reader.Char;
    case TheChar of
      'A'..'Z':
        ; // Process Alphas
      '0'..'9':
        ; // Process Digits
    end;
  end;

Use overloaded methods to read up to a particular character or a set of characters:

{ Read all characters up to the Dollar sign. }
Reader.ReadCharsTill('$');
{ Read all characters up to either '(' or ')'. }
Reader.ReadCharsTill('(', ')');
{ Skip rest of line and advance to next one. }
Reader.SkipLine;

Advanced parsing:

  • An RFC compliant CSV Parser is part of DIUnicode. Source code is available as a feature demonstration.
  • The popular DIHtmlParser is build on top of DIUnicode. It implements a full featured HTML, XHTML and XML parser with Unicode support and a flexible plugin architecture.

Peek Ahead / Look Ahead reading

Unlike other text readers, the lookahead features of TDIUnicodeReader are not limited to a fixed number of characters but by available memory only. The code below reads up to five Unicode characters into the internal buffer. TDIUnicodeReader could well look ahead much further, but this should not be abused and the number kept reasonably small.

var
 UR: TDIUnicodeReader;
 c: WideChar;
begin
  { ... TDIUnicodeReader creation
        and initialization should go here ... }
  UR.PeekAhead(5); // Read up to 5 characters to internal buffer.
  if UR.PeekedCount >= 1 then // Test if 1st peekd character could be read ...
    c := TDIUnicodeReader.PeekedChars[0]; // and examine it.
  if UR.PeekedCount >= 5 then  // Same as above ...
    c := TDIUnicodeReader.PeekedChars[4]; // but with 5th peeked chararcter now.
  c := UR.ReadChar; // Continue reading with next char.

Performance

DIUnicode is extremely fast, even when processing very large files. Both the reader and the writer classes benefit from their internal buffers which allows them to read and write files in small chunks of data, one at a time only. DIUnicode will never require you to fit the entire file into memory. This way it achieves conversion rates of far over 20 MB per second.

products/unicode/index.txt · Last modified: 2016/01/22 15:08 by 127.0.0.1