Hackers Libray: Text Handling

Humans write information down as ''text," composed of words, figures, and punctuation; the words are constructed using a combination of uppercase and lowercase letters, depending on their grammatical use. Consequently, processing text using a computer is a difficult, yet commonly required task. The ANSI C definitions include string-processing functions that are, by their nature, case-sensitive; that is, the letter capital A is regarded as distinct from the lowercase letter a. This is the first problem that must be overcome by the programmer. Fortunately, both Borland's Turbo C compilers and Microsoft's C compilers include case-insensitive forms of the string functions.

For example, stricmp( ) is the case-insensitive form of strcmp( ) and strnicmp( ) is the case-insensitive form of strncmp( ). If you are concerned about writing portable code, then you must restrict yourself to the ANSI C functions, and write your own case-insensitive functions using the tools provided.

Here is a simple implementation of a case-insensitive version of strstr( ). The function simply makes a copy of the parameter strings, converts those copies to uppercase, then does a standard strstr( ) on the copies. The offset of the target string within the source string will be the same for the copy as the original, and so it can be returned relative to the parameter string:

char *stristr(char *s1, char *s2)

{

char c1[1000]; char c2[1000];

char *p;

strcpy(c1,s1); strcpy(c2,s2);

strupr(c1); strupr(c2);

p = strstr(c1,c2);

if (p)

return s1 + (p - c1); return NULL;

}

This function scans a string, si, looking for the word held in s2. The word must be a complete word, not simply a character pattern, for the function to return TRUE. It makes use of the stristr( ) function described previously:

int word_in(char *s1,char *s2)

{

/* return non-zero if s2 occurs as a word in s1 */ char *p; char *q; int ok;

ok = 0; q = s1;

do

{

/* Locate character occurence s2 in s1 */ p = stristr(q,s2);

if (p) {

/* Found */

ok = 1;

if (p > s1)

{

/* Check previous character */

if (*(p - 1) >= 'A' && *(p - 1) <= 'z')

ok = 0;

}

if (*p) {

/* Check character following */ if (*p >= 'A' && *p <= 'z')

ok = 0;

}

}

q = p;

}

while(p && !ok); return ok;

}

More useful functions for dealing with text are the following: truncstr( ), which truncates a string:

void truncstr(char *p,int num)

{

/* Truncate string by losing last num characters */ if (num < strlen(p))

p[strlen(p) - num] = 0;

}

trim( ), which removes trailing spaces from the end of a string: void trim(char *text)

{

/* remove trailing spaces */ char *p;

p = &text[strlen(text) - 1]; while(*p == 32 && p >= text) *p-- = 0;

}

strlench( ), which changes the length of a string by adding or deleting characters:

void strlench(char *p,int num)

length of string by adding or deleting characters */

0)

(p + num,p,strlen(p) + 1); - num;

(p,p + num,strlen(p) + 1);

{

/* Change

if (num > memmove else {

num = 0 memmove

}

}

strins( ), which inserts a string into another string:

/* Insert string q into p */

strlench(p,strlen(q));

strncpy(p,q,strlen(q));

}

and strchg( ), which replaces all occurrences of one substring with another within a target string:

void strchg(char *data, char *s1, char *s2)

{

/* Replace all occurrences of s1 with s2 */ char *p; char changed;

do {

changed = 0;

p = strstr(data,s1);

if (p)

{

/* Delete original string */ strlench(p,0 - strlen(s1));

/* Insert replacement string

strins(p,s2);

changed = 1;

}

}

while (changed);

}

Time

C provides the time( ) function to read the computer's system clock and return the system time as a number of seconds since midnight January 1, 1970. This value can be converted to a useful string with the function ctime( ), as illustrated:

#include #include

int main()

{

/* Structure to hold time, as defined in time.h */ time_t t;

/* Get system date and time from computer */ t = time(NULL);

printf("Today's date and time: %s\n",ctime(&t));

}

The string returned by ctime( ) is composed of seven fields:

Day of the week

Month of the year

Date of the day of the month

Hour

Minutes

Seconds

Century

These are terminated by a newline character and null-terminating byte. Since the fields always occupy the same width, slicing operations can be carried out on the string with ease. The following program defines a structure, time, and a function, gettime( ), which extracts the hours, minutes, and seconds of the current time, and places them in the structure:

#include #include

struct time

{

int ti_min; /* Minutes */
int ti_hour; /* Hours */

int ti_sec; /* Seconds */ } ;

void gettime(struct time *now)

{

time_t t; char temp[26]; char *ts;

/* Get system date and time from computer */ t = time(NULL);

/* Translate dat and time into a string */ strcpy(temp,ctime(&t));

/* Copy out just time part of string */

temp[19] = 0;

ts = &temp[11];

/* Scan time string and copy into time structure */ sscanf(ts,"%2d:%2d:%2d",&now->ti_hour,&now->ti_min,&now->ti_sec);

}

int main()

{

struct time now; gettime(&now);

printf("\nThe time is %02d:%02d:%02d",now.ti_hour,now.ti_min,now. ti_sec);

}

The ANSI standard on C does provide a function to convert the value returned by time( ) into a structure, as shown in the following snippet. Also note the structure 'tm' is defined in time.h:

#include #include

int main()

{

time_t t; struct tm *tb;

/* Get time into t */ t = time(NULL);

/* Convert time value t into structure pointed to by tb */ tb = localtime(&t);

printf("\nTime is %02d:%02d:%02d",tb->tm_hour,tb->tm_min,tb-

>tm_sec);

}

struct tm

{

int tm_sec; int tm_min; int tm_hour; int tm_mday; int tm_mon; int tm_year; int tm_wday; int tm_yday; int tm_isdst;

};

Timers

Often a program must determine the date and time from the host computer's nonvolatile RAM. Several time functions are provided by the ANSI standard on C that enable a program to retrieve the current date and time. First, time( ) returns the number of seconds that have elapsed since midnight on January 1, 1970. It has the prototype:

time_t time(time_t *timer);

Here, time( ) fills in the timet variable, sent as a parameter, and returns the same value. You can call time( ) with a NULL parameter and collect the return value, as in:

#include

void main()

{

time_t now;

now = time(NULL);

Here, asctime() converts a time block to a twenty six character string of the format. The asctime( ) function has the prototype:

char *asctime(const struct tm *tblock);

Next, ctime( ) converts a time value (as returned by time( )) into a 26-character string of the same format as asctime( ). For example:

#include #include

void main()

{

time_t now; char date[30];

now = time(NULL); strcpy(date,ctime(&now));

}

Another time function, difftime( ), returns the difference, in seconds, between two values (as returned by time( )). This can be useful for testing the elapsed time between two events, the time a function takes to execute, and for creating consistent delays that are extraneous to the host computer. An example delay program would be:

#include #include

void DELAY(int period)

{

time_t start;

start = time(NULL);

while(time(NULL) < start + period)

r

}

void main()

{

printf("\nStarting delay now... .(please wait 5 seconds)");

DELAY(5);

puts("\nOkay, I've finished!");

}

The gmtime( ) function converts a local time value (as returned by time ()) to the GMT time, and stores it in a time block. This function depends upon the global variable time zone being set. The time block is a predefined structure (declared in time.h) as follows:

struct tm

{

int tm_sec; int tm_min;

int tm_hour; int tm_mday; int tm_mon; in t tm_year; int tm_wday; int tm_yday; int tm_isdst;

};

Here, tmmday records the day of the month, ranging from 1 to 31; tmwday is the day of the week, with Sunday being represented by 0; the year is recorded from 1900 on; tmisdst is a flag to show whether daylight savings time is in effect. The actual names of the structure and its elements may vary from compiler to compiler, but the structure should be the same.

The mktime( ) function converts a time block to a calendar format. It follows the prototype:

time_t mktime(struct tm *t);

The following example allows entry of a date, and uses mktime( ) to calculate the day of the week appropriate to that date. Only dates from January 1, 1970 to the present are recognizable by the time functions:

#include #include #include

void main()

{

struct tm tsruct; int okay;

char data[100];

char *p;

char *wday[] = { "Sunday", "Monday", "Tuesday", "Wednesday", "Thu rsday", "Friday", "Saturday" ,

"prior to 1970, thus not known" } ;

do {

okay = 0;

printf(" \nEnter a date as dd/mm/yy "); p = fgets(data,8,stdin); p = strtok(data,"/");

if (p != NULL)

tsruct.tm_mday = atoi(p); else

continue;

p = strtok(NULL,"/");

if (p != NULL)

tsruct.tm_mon = atoi(p); else

continue;

p = strtok(NULL,"/");

if (p != NULL)

tsruct.tm_year = atoi(p); else

continue; okay = 1;

}

while(!okay);

tsruct.tm_hour = 0;

tsruct.tm_min = 0; tsruct.tm_sec = 1; tsruct.tm_isdst = -1;

/* Now get day of the week */ if (mktime(&tsruct) == -1) tsruct.tm_wday = 7;

printf ("That was %s\n",wday[tsruct.tm_wday]);

}

The mktime( ) function also makes the necessary adjustments for values out of range. This capability can be utilized for discovering what the date will be in n number of days, as shown here:

#include #include include

void main()

{

struct tm *tsruct; time_t today;

today = time(NULL); tsruct = localtime(&today);

tsruct->tm_mday += 10; mktime(tsruct);

printf ("In ten days it will be %02d/%02d/%2d\n", tsruct-

>tm_mday,tsruct->tm_mon + 1,tsruct->tm_year);

}

Header Files

Function prototypes for library functions supplied with the C compiler, and standard macros, are declared in header files. The ANSI standard on the C programming language lists the following header files:

DESCRIPTION

HEADER

FILE

assert.h Defines the assert debugging macro.

ctype.h Contains character classification and conversion macros.

errno.h Contains constant mnemonics for error codes.

float.h Defines implementation-specific macros for dealing with floating-point mathematics.

limits.h Defines implementation-specific limits on type values.

locale.h Contains country-specific parameters.

math.h Lists prototypes for mathematics functions.

setjmp.h Defines typedef and functions for setjmp/longjmp.

signal.h Contains constants and declarations for use by signal( ) and raise( ).

stdarg.h Contains macros for dealing with argument lists.

stddef.h Contains common data types and macros.

stdio.h Lists types and macros required for standard I/O.

stdlib.h Gives prototypes of commonly used functions and miscellany.

string.h Contains string manipulation function prototypes. time.h Contains structures for time-conversion routines.

Debugging

The ANSI standard on C includes a macro function for debugging. Called assert( ), this expands to an if( ) statement, which if it returns TRUE, terminates the program and outputs to the standard error stream a message:

Assertion failed: , file , line Abnormal program termination

For example, the following program accidentally assigns a zero value to a pointer:

#include #include

main{ } {

/* Demonstration of assert */

int *ptr; int x;

x = 0;

/* Whoops! error in this line! */ ptr = x;

When run, this program terminates with the following message:

Assertion failed: ptr != 0, file TEST.C, line 16 Abnormal program termination

When a program is running smoothly, the assert( ) functions can be removed from the compiled program simply by adding, before #include , the line:

#define NDEBUG

Essentially, the assert functions are commented out in the preprocessed source before compilation. This means that the assert expressions are not evaluated and thus cannot cause any side effects.

Float Errors

Floating-point numbers are decimal fractions that do not accurately equate to normal fractions (not every number will divide evenly by 10). This creates the potential for rounding errors in calculations that use floating-point numbers. The following program illustrates one such example of rounding error problems:

#include

void main()

{

float number;

for(number = 1; number > 0.4; number -= 0.01) printf ("\n%f",number);

}

Here, at about 0.47 (depending upon the host computer and compiler) the program would start to store an inaccurate value for number.

This problem can be minimized by using longer floating-point numbers, doubles, or long doubles that have larger storage space allocated to them. For really accurate work, though, you should use integers and convert to a floating-point number only for display. Also be aware that most C compilers default floating-point numbers to doubles, and when using float types have to convert the double down to a float.

Error Handling

When a system error occurs within a program—that is, when an attempt to open a file fails—it is helpful for the program to display a message reporting the failure.It is equally useful to the program's developer to know why the error occurred, or at least as much about it as possible. To accommodate this exchange of information, the ANSI standard on C describes a function, perror( ), which has the prototype:

void perror(const char *s);

The program's own prefixed error message is passed to perror( ) as the string parameter. This error message is displayed by perror( ), followed by the host's system error (separated by a colon). The following example illustrates a usage of perror( ):

#include void main()

{

FILE *fp;

char fname[] = "none.xyz";

fp = fopen(fname,"r");

if(!fp)

perror(fname); return;

}

If the fopen( ) operation fails, a message is displayed, similar to this one: none.xyz: No such file or directory

Note, perror( ) sends its output to the predefined stream stderr, which is usually the host computer's display unit. Then, perror( ) finds its message from the host computer via the global variable errno, which is set by most, but not all system functions.

Unpleasant errors might justify the use of abort( ), a function that terminates the running program with a message such as: ''Abnormal program termination," and returns an exit code of 3 to the parent process or operating system.

Critical Error Handling with the IBM PC and DOS

The PC DOS operating system provides a user-amendable critical error-handling function. This function is usually discovered by attempting to write to a disk drive that does not have a disk in it, in which case the familiar:

Not ready; error writing drive A Abort Retry Ignore?

message is displayed on the screen. The following example program shows how to redirect the DOS critical error interrupts to your own function:

#include #include

void interrupt new_int(); void interrupt (*old_int)();

char status;

FILE *fp;

/* Set critical error handler to my function */ setvect(0x24,new_int);

fp

Generate an error by not having a disk in drive A */ = fopen("a:\\data.txt","w+");

/* Display error status returned */ printf("\nStatus == %d",status);

void interrupt new_int() {

/* set global error code */ status = _DI;

/* ignore error and return */

_AL = 0;

}

When the DOS critical error interrupt is called, a status message is passed in the low byte of the DI register. This message is one of the following:

CODE

MEANING

00

Write-protect error.

01

Unknown unit.

02

Drive not ready.

03

Unknown command.

04

Data error, bad CRC.

05

Bad request structure length.

06

Seek error.

07

Unknown media type.

08

Sector not found.

09

Printer out of paper.

0A

Write error.

0B

Read error.

0C

General failure.

Your critical error interrupt handler can transfer this status message into a global variable, then set the result held in register AL to one of these:

CODE ACTION

Ignore error.

Retry.

02 03

Terminate program.

Fail (Available with DOS 3.3 and above).

If you choose to set AL to 02, terminate program, be sure that all files are closed first because DOS will terminate the program abruptly, leaving files open and memory allocated.

The following is a practical function for checking whether a specified disk drive can be accessed. It should be used with the earlier critical error handler and global variable status:

int DISKOK(int drive)

{

/* Checks for whether a disk can be read */ /* Returns false (zero) on error */ /* Thus if(!DISKOK(drive)) */ /* error(); */

unsigned char buffer[25];

/* Assume okay */ status = 0;

/* If already logged to disk, return okay */ if ('A' + drive == diry[0]) return(1);

/* Attempt to read disk */ memset(buffer,0,20);

sprintf(buffer,"%c:$$$.$$$",'A'+drive);

_open(buffer,O_RDONLY);

/* Check critical error handler status */ if (status == 0) return(1);

/* Disk cannot be read */ return(0);

}

Casting

Casting tells the compiler what a data type is, and it can be used to change a data type. For example, consider the following snippet:

#include

int x; int y;

printf("\n%lf",x / y);

}

The printf( ) function here has been told to expect a double; however, the compiler sees the variables x and y as integers, and an error occurs. To make this example work, you must tell the compiler that the result of the expression x/y is a double, with a cast:

#include

void main()

{

int x; int y;

x = 10; y = 3;

printf("\n%lf",(double)(x / y));

}

Notice that the data type double is enclosed by parentheses, and so is the expression to convert. But now, the compiler knows that the result of the expression is a double, as well as that the variables x and y are integers. With this, an integer division will be carried out; therefore, it is necessary to cast the constants:

#include

void main()

{

int x; int y;

x = 10; y = 3;

printf("\n%lf",(double)(x) / (double)(y));

}

Finally, because both of the constants are doubles, the compiler knows that the outcome of the expression will also be a double.

Prototyping

Prototyping a function involves letting the compiler know, in advance, what type of values a function will receive and return. For example, look at strtok( ) with this prototype:

char *strtok(char *s1, const char *s2);

This tells the compiler that strtok( ) will return a character pointer. The first parameter received will be a pointer to a character string, and that string can be changed by strtok( ). The last parameter will be a pointer to a character string that strtok( ) cannot change. The compiler knows how much space

283

to allocate for the return parameter, sizeof(char *), but without a prototype for the function the compiler will assume that the return value of strtok( ) is an integer, and will allocate space for a return type of int (sizeof(int)). If an integer and a character pointer occupy the same number of bytes on the host computer, no major problems will occur, but if a character pointer occupies more space than an integer, the compiler will not have allocated enough space for the return value, and the return from a call to strtok( ) will overwrite some other bit of memory.

Fortunately, most C compilers will warn the programmer if a call to a function has been made without a prototype, so that you can add the required function prototypes. Consider the following example that will not compile on most modern C compilers due to an error:

#include

int FUNCA(int x, int y)

{

return(MULT(x,y));

double MULT(double x, double y)

{

return(x * y);

}

main()

{

printf("\n%d",FUNCA(5,5));

}

When the compiler first encounters the function MULT( ), it is assumed as a call from within FUNCA( ). In the absence of any prototype for MULT( ), the compiler assumes that MULT( ) returns an integer. When the compiler finds the definition for function MULT( ), it sees that a return of type double has been declared. The compiler then reports an error in the compilation, such as:

"Type mismatch in redeclaration of function 'MULT'"

The compiler is essentially telling you to prototype your functions before using them! If this example did compile and execute, it would probably crash the computer's stack.

Pointers to Functions

C allows a pointer to point to the address of a function, and this pointer will be called rather than specifying the function. This is used by interrupt-changing functions and may be used for indexing functions rather than using switch statements. For example:

#include #include

double (*fp[7])(double x);

double x; int p;

fp[0] =

sin;

fp[1] =

cos;

fp[2] =

acos;

fp[3] =

asin;

fp[4] =

tan;

fp[5] =

at an;

fp[6] =

ceil;

p = 4;

x = fp[p](1.5);

printf ("\nResult %lf",x);

}

This example program defines an array of pointers to functions, (*fp[ ])( ), that are called dependent upon the value in the indexing variable p. This program could also be written as:

#include #include

void main()

{

double x; int p;

p = 4;

}

The first example, using pointers to the functions, compiles into much smaller code, and executes faster than the second example. The table of pointers to functions is a useful facility when writing language interpreters. The program compares an entered instruction against a table of keywords that results in an index variable being set. The program simply needs to call the function pointer, indexed by the variable, rather than wading through a lengthy switch( ) statement.

Sizeof

A preprocessor instruction, sizeof, returns the size of an item, be it a structure, pointer, string, or whatever. However, care is required for using sizeof: consider the following program:

#include #include

char string1[80]; char *text = "This is a string of data" ; void main()

{

/* Initialize string1 correctly */ memset(string1,0,siz eof(string1));

/* Copy some text into string1 ? */ memcpy(string1,text,sizeof(text));

/* Display string1 */

printf("\nString 1 = %s\n",string1);

}

This example says to initialize all 80 elements of string1 to zeroes, then copy the constant string text into the variable string1. However, variable text is a pointer, so the sizeof(text) instruction returns the size of the character pointer (perhaps 2 bytes) rather than the length of the string pointed to by the pointer. If the length of the string pointed to by text happened to be the same as the size of a character pointer, an error would not be noticed.

Interrupts

The PC BIOS and DOS contain functions that may be called by a program by way of the function's interrupt number. The address of the function assigned to each interrupt is recorded in a table in RAM, called the interrupt vector table. By changing the address of an interrupt vector, a program can effectively disable the original interrupt function and divert any calls to it to its own function.

Borland's Turbo C provides two library functions for reading and changing an interrupt vector: setvect( ) and getvect( ). The corresponding Microsoft C library functions are: _dos_getvect( ) and _dos_setvect( ).

The getvect( ) function has this prototype:

void interrupt(*getvect(int interrupt_no))(); And setvect( ) has this prototype:

void setvect(int interrupt_no, void interrupt(*func)());

To read and save the address of an existing interrupt, a program uses getvect( ) in this way:

void interrupt(*old)(void); main()

{

/* get old interrupt vector */ old = getvect(0x1C);

Here, 0><1C is the interrupt vector to be retrieved. To set the interrupt vector to a new address, your own function, use setvect( ):

void interrupt new(void)

{

/* New interrupt function */

}

main()

{

setvect(0x1C,new);

}

There are two important points to note when it comes to interrupts. First, if the interrupt is called by external events, before changing the vector you must disable the interrupt callers, using disable( ). Then you reenable the interrupts after the vector has been changed, using enable( ). If a call is made to the interrupt while the vector is being changed, anything could happen.

Second, before your program terminates and returns to DOS, you must reset any changed interrupt vectors. The exception to this is the critical error handler interrupt vector, which is restored automatically by DOS, hence your program needn't bother restoring it.

This example program hooks the PC clock timer interrupt to provide a background clock process while the rest of the program continues to run:

#include #include #include #include #include

enum { FALSE, TRUE } ;

#define COLOR (BLUE << 4) | YELLOW

#define BIOS_TIMER 0x1C

static unsigned installed = FALSE; static void interrupt (*old_tick) (void);

static void interrupt tick (void) {

int i;

struct tm *now;

time_t this_time;

char time_buf[9];

static time_t last_time = 0L;

static char video_buf[2 0] =

{

' ', COLOR, '0', COLOR, '0', COLOR, COLOR, '0', COLOR,

'0', COLOR, COLOR, '0', COLOR, '0', COLOR, ' ', COLOR

};

enable ();

if (time (&this_time) != last_time) {

last_time = this _time;

now = localtime(&this_time);

sprintf(time_buf, "%02d:%02d.%02d",now->tm_hour,now->tm_min,now->tm_sec);

for (i = 0; i < 8; {

video_buf[(i + 1) << 1] = time_buf[i];

}

puttext (71, 1, 80, 1, video_buf);

}

old_tick ();

}

void stop_clock (void)

{

if (installed) {

setvect (BIOS_TIMER, old_tick); installed = FALSE;

}

}

void start_clock (void)

{

static unsigned first_time = TRUE;

if (!installed) {

if (first_time) atexit (stop_clock); first_time = FALSE;

}

old_tick = getvect (BIOS_TIMER); setvect (BIOS_TIMER, tick); installed = TRUE;

}

}

Signal

Interrupts raised by the host computer can be trapped and diverted in several ways. A simple method is to use signal( ). Signal() takes two parameters in the form:

void (*signal (int sig, void (*func) (int))) (int);

The first parameter, sig, is the signal to be caught. This is often predefined by the header file signal.h. The second parameter is a pointer to a function to be called when the signal is raised. This can be either a user function or a macro defined in the header file signal.h, to do some arbitrary task, such as ignore the signal.

On a PC platform, it is often useful to disable the Ctrl-Break key combination that is used to terminate a running program by the user. The following PC signal( ) call replaces the predefined signal SIGINT, which equates to the Ctrl-Break interrupt request with the predefined macro SIG-IGN, and ignores the request:

signal(SIGINT,SIG_IGN);

This example catches floating-point errors on a PC, and zero divisions:

#include #include

void (*old_sig)();

void catch(int sig)

{

printf("Catch was called with: %d\n",sig);

}

void main()

{

int a; int b;

old_sig = signal(SIGFPE,catch);

a = 0;

b = 10 / a;

/* Restore original handler before exiting! */

signal(SIGFPE,old_sig);

Hackers Libray

Sunday, December 6, 2009

Text Handling

No comments:

Post a Comment

Subscribe Now: Feed Icon

FeedBurner FeedCount

Subscribe via email

My Headlines

Followers

Blog Archive

About Me