Introduction
As we discontinue research use of the PSUVM mainframe, a few people have shared some of their conversion problems with us. One of these is moving and handling large data files.
To get an idea of what porting sizable files to a microcomputer implies, we did a short, quick experiment "handling" a large file of integer data. This "handling" and some timings are reported here. This shows that, at least for a few simple kinds of data manipulation, relatively large files can be handled on microcomputers rather efficiently. And it gives some ideas of what minimal "power" microcomputers need to have in order to process this kind of data. Part of that "power" is having reasonably defragmented disk(s); if the contrary is true, I/O time particularly for large files increases rather dramatically. We also show where to find out more or download various tools used here.
By the way, we do know that microcomputer versions of SAS®, MINITAB®, SPSS® will handle large and larger files similar to the one used here. We did not time these applications, but performance was quite good on the platform described below.
Platform used for timings
PC Dell Optiplex GX1p, 500MHz, 384MB Ram, 1GB swap file, 2.8GB IDE fixed disks.
Sample file used for these timings
The file is an ASCII file that is 145,648,282 (139MB) bytes big. Bill Verity uploaded this file, women94a.data, from the mainframe for a research project. The file's content is supposed to be integers, blanks, and minus (-) sign. It has 5083 lines, each of which is 28653 bytes wide.
Text editor used
KEDIT® is an excellent Windows 9x/NT text editor. See http://www.kedit.com/ VEDIT® is another editor choice for large files; see http://www.vedit.com/ .
Operations Timed: Input, edit, scan for validity, output, subset.
Input
Command: Kedit women94a.data (width
29000
Time: 32 seconds; second and subsequent Kedit's time: 5
seconds (cache)
Edit
Kedit Subcommands: add, delete, copy
or modify, move lines; search for string.
Time: virtually instant.
Scan
Kedit subcommand to scan the file
for valid character content: that is, show all non-integers: all reg /[~0-9 \-]
Time: 45 seconds
Output
Kedit subcommand: FILE/SAVE w.dat:
.
Time: 32 seconds
Subset
Kedit (width 29000 then issue
Kedit subcommand:
get Women 94a.data 101 100
to get records 101, 102,
... 200.
Time: 4 seconds
Use SAS for more sophisticated criteria for subsetting a large file. For example see: http://www.swmed.edu/home_pages/infotimes/articles/v.no6 /v6sastip.htm.
System COPY command:
COPY woman94a.data
w.dat
Time: 33 seconds
The following sorts yield identical results:
System command:
SORT /+1 < women94a.data > wsort.dat
Time: 181 seconds
Command: Kedit women94a.data, SORT * A 1 1,
FILE wsort.dat
Time: 65 seconds
Comparing two large files via Windows native (system) FC command versus a Fortran implementation, a 32-bit console command, COMPARE.EXE.
The two files will compare "same" or "identical" in this case. The shorter second compare time is because the system recognizes that there are no page faults once a copy of the file is in pagable memory.
System compare: FC w.dat women94a.data
Time: First FC: 513 seconds; second FC: 513 seconds
COMPARE w.dat women94a.data
Time:
First Compare: 34 seconds; second Compare 5 seconds.
Compare utility may be found at:
http://ftp.cac.psu.edu/pub/ger/fortran/hdk/compare.exe
Documentation is the file: http://ftp.cac.psu.edu/pub/ger/fortran/hdk/compare.txt
COMPARE.EXE compares 256000 characters per compare; native Windows FC compares 1 cpc; this is one reason COMPARE was written and is made available to the public.
Note: To binary compare many pairs of files for "same" or "different" in two subdirectories and also optionally in two children subdirectories, use the program, CSDIFF.
Get the "Standalone" version from: http://www.ComponentSoftware.com/csdiff/. CSDIFF also can do an "intelligent" compare of two TEXT, HTML, or MS WORD files; it will display file differences in one of two easy to understand formats. CSDIFF is free for personal use.
INFOZIP ZIP and UNZIP are free Zip
compress/uncompress Win32 programs that work under all Windows platforms. They
are also available for other platforms. Here we compress and uncompress a large
sample data file. InfoZip zip includes cyclic redundancy check bytes in the zip
file and a check against this with unzip.
INFOZIP ZIP/UNZIP for
Windows 9x/NT/2000 are available on the Web: ftp://ftp.cdrom.com/pub/infozip/WIN32/zip22xN.zip
and ftp://ftp.cdrom.com/pub/infozip/WIN32/unz540xN.exe
respectively.
zip -j women94a.zip
women94a.data
Time: 53 seconds; compressed the file to: 11,980,107 bytes, including
92 bytes of crc, a factor of 92%.
unzip women94a.zip
Time: 37
seconds.
PKWARE® PKZIP and PKUNZIP are
commercial versions of Zip compression tools. Here we use the DOS 32-bit
versions. Other versions, including versions that run via Windows Explore, are
available at:
http://www.pkware.com.
PKZIP -a -! women94a.zip WOMEN9~1.DAT
Time: 38 seconds; compressed the file to: 11,590,145 bytes, a factor of
93%. This version of PKZIP/PKUNZIP recognizes only DOS 8.3 file ids. The -!
option creates "authentication" check bytes, similar to the crc of ZIP/UNZIP
above.
PKUNZIP women94a.zip
Time: 39
seconds.
For further information please see,
http://ftp.cac.psu.edu/pub/ger/documents/handstat.html