Handling Large Data Files on a PC,
Some Techniques and Timings

By H. D. Knoble and Bill Verity, Center for Academic Computing

Introduction

As we discontinue research use of the PSUVM mainframe, a few people have shared some of their conversion problems with us. One of these is moving and handling large data files.

To get an idea of what porting sizable files to a microcomputer implies, we did a short, quick experiment "handling" a large file of integer data. This "handling" and some timings are reported here. This shows that, at least for a few simple kinds of data manipulation, relatively large files can be handled on microcomputers rather efficiently. And it gives some ideas of what minimal "power" microcomputers need to have in order to process this kind of data. Part of that "power" is having reasonably defragmented disk(s); if the contrary is true, I/O time particularly for large files increases rather dramatically. We also show where to find out more or download various tools used here.

By the way, we do know that microcomputer versions of SAS®, MINITAB®, SPSS® will handle large and larger files similar to the one used here. We did not time these applications, but performance was quite good on the platform described below.

Platform used for timings

PC Dell Optiplex GX1p, 500MHz, 384MB Ram, 1GB swap file, 2.8GB IDE fixed disks.

Sample file used for these timings

The file is an ASCII file that is 145,648,282 (139MB) bytes big. Bill Verity uploaded this file, women94a.data, from the mainframe for a research project. The file's content is supposed to be integers, blanks, and minus (-) sign. It has 5083 lines, each of which is 28653 bytes wide.

Text editor used

KEDIT® is an excellent Windows 9x/NT text editor. See http://www.kedit.com/ VEDIT® is another editor choice for large files; see http://www.vedit.com/ .

Operations Timed: Input, edit, scan for validity, output, subset.

Input
Command: Kedit women94a.data (width 29000
Time: 32 seconds; second and subsequent Kedit's time: 5 seconds (cache)

Edit
Kedit Subcommands: add, delete, copy or modify, move lines; search for string.
Time: virtually instant.

Scan
Kedit subcommand to scan the file for valid character content: that is, show all non-integers: all reg /[~0-9 \-]
Time: 45 seconds

Output
Kedit subcommand: FILE/SAVE w.dat: .
Time: 32 seconds

Subset
Kedit (width 29000 then issue Kedit subcommand:
get Women 94a.data 101 100
to get records 101, 102, ... 200.
Time: 4 seconds

Use SAS for more sophisticated criteria for subsetting a large file. For example see: http://www.swmed.edu/home_pages/infotimes/articles/v.no6 /v6sastip.htm.

System copying and sorting

System COPY command:
COPY woman94a.data w.dat
Time: 33 seconds

The following sorts yield identical results:
System command:
SORT /+1 < women94a.data > wsort.dat
Time: 181 seconds
Command: Kedit women94a.data, SORT * A 1 1, FILE wsort.dat
Time: 65 seconds

Compare large files for content

Comparing two large files via Windows native (system) FC command versus a Fortran implementation, a 32-bit console command, COMPARE.EXE.

The two files will compare "same" or "identical" in this case. The shorter second compare time is because the system recognizes that there are no page faults once a copy of the file is in pagable memory.

System compare: FC w.dat women94a.data
Time: First FC: 513 seconds; second FC: 513 seconds

COMPARE w.dat women94a.data
Time: First Compare: 34 seconds; second Compare 5 seconds.

Compare utility may be found at:
http://ftp.cac.psu.edu/pub/ger/fortran/hdk/compare.exe

Documentation is the file: http://ftp.cac.psu.edu/pub/ger/fortran/hdk/compare.txt

COMPARE.EXE compares 256000 characters per compare; native Windows FC compares 1 cpc; this is one reason COMPARE was written and is made available to the public.

Note: To binary compare many pairs of files for "same" or "different" in two subdirectories and also optionally in two children subdirectories, use the program, CSDIFF.

Get the "Standalone" version from: http://www.ComponentSoftware.com/csdiff/. CSDIFF also can do an "intelligent" compare of two TEXT, HTML, or MS WORD files; it will display file differences in one of two easy to understand formats. CSDIFF is free for personal use.

Compressing/Uncompressing programs used

INFOZIP ZIP and UNZIP are free Zip compress/uncompress Win32 programs that work under all Windows platforms. They are also available for other platforms. Here we compress and uncompress a large sample data file. InfoZip zip includes cyclic redundancy check bytes in the zip file and a check against this with unzip.

INFOZIP ZIP/UNZIP for Windows 9x/NT/2000 are available on the Web: ftp://ftp.cdrom.com/pub/infozip/WIN32/zip22xN.zip and ftp://ftp.cdrom.com/pub/infozip/WIN32/unz540xN.exe respectively.

zip -j women94a.zip
women94a.data

Time: 53 seconds; compressed the file to: 11,980,107 bytes, including 92 bytes of crc, a factor of 92%.

unzip women94a.zip
Time: 37 seconds.

PKWARE® PKZIP and PKUNZIP are commercial versions of Zip compression tools. Here we use the DOS 32-bit versions. Other versions, including versions that run via Windows Explore, are available at:
http://www.pkware.com.

PKZIP -a -! women94a.zip WOMEN9~1.DAT
Time: 38 seconds; compressed the file to: 11,590,145 bytes, a factor of 93%. This version of PKZIP/PKUNZIP recognizes only DOS 8.3 file ids. The -! option creates "authentication" check bytes, similar to the crc of ZIP/UNZIP above.

PKUNZIP women94a.zip
Time: 39 seconds.

For further information please see,
http://ftp.cac.psu.edu/pub/ger/documents/handstat.html


Previous Nextback

Back to Newsletter Home Page