Understanding the Excel. xls Binary File Format
Summary: Learn about the MS-XLS binary file format that is used in previously released Microsoft Excel products. Included in this article are the basic structures and key concepts for interacting with this file format programmatically.
The recommended way to perform most programming tasks in Microsoft Excel is to use the Excel Primary Interop Assemblies. These are a set of. NET classes that provide a complete object model for working with Microsoft Excel. This article series deals only with advanced scenarios, such as where Microsoft Excel is not installed.
Key Components of the MS-XLS File Format
The record descriptions in the [MS-XLS]: Excel Binary File Format (.xls) Structure Specification do not include mention of the Record Type (rt) and Count of Bytes (cb) values that make up the first four bytes of the record. For more information, see section 2.1.4 of the MS-XLS specification.
These are the main streams, substreams, and records in an MS-XLS format file. Specific byte locations within a record are counted from the end of the cb field.
The Workbook stream is the primary stream in an. xls file. It contains multiple substreams, each of which starts with a Beginning of File (BOF) record and ends with an End of File (EOF) record. The first stream is always the Globals substream, and the rest are sheet substreams. These include worksheets, macro sheets, chart sheets, dialog sheets, and VBA module sheets.
The Globals substream specifies global properties and data in a workbook. It also includes a BoundSheet8 record for each substream in the Workbook stream.
A BoundSheet8 record gives information about a sheet substream. This includes name, location, type, and visibility. The first 4 bytes of the record, the lbPlyPos FilePointer, specifies the position in the Workbook stream where the sheet substream starts.
The Worksheet substream specifies a sheet in a workbook.
The cell table is the part of a sheet stream where cells are stored. It contains a series of row blocks, each of which has a capacity of 32 rows of cells, and are filled sequentially. Each row block starts with a series of Row records, followed by the cells that go in the rows, and ends with a DBCell record, which gives the starting offset of the first cell of each row in the block.
A Row record defines a row in a sheet. This is a complex structure, but only the first 6 bytes are needed for basic content retrieval. These give the row index and the columns of the first cells and last cells that contain data or unique formatting in the row.
All of the cells in a row block are stored after the last row in the block. There are seven kinds of records that represent actual cells in a worksheet. Most cell records begin with a 6-byte Cell structure. The first 2 of those bytes specify the row, the next 2 bytes specify the column, and the last 2 bytes specify an XF record in the Globals substream that contains formatting information.
The following records represent the different kinds of cells. Unless specified otherwise, the first 6 bytes are taken up by the cell structure, and the remaining bytes contain the value.
Blank cell record
A Blank cell record specifies a blank cell that has no formula or value. This record type is used only for cells that contain individual formatting; otherwise, blank cells are stored in MulBlank records or not at all.
RK cell record
An RK cell record contains a 32-bit number. Excel automatically converts numbers that can be represented in 32 bits or less to this format for storage as a way to reduce file size. Instead of a 6-byte cell structure, the first 2 bytes specify the row and the second 2 bytes specify the column. The remaining 6 bytes define the number in an RkRec structure for disk and memory optimization.
BoolErr cell record
A BoolErr cell record contains a 2-byte Bes structure that may be either a Boolean value or an error code.
Number cell record
A Number cell record contains a 64-bit floating-point number.
A LabelSst cell record contains a 4-byte integer that specifies a string in the Shared Strings Table (SST). Specifically, the integer corresponds to the array index in the RGB field of the SST.
Formula cell record
A Formula cell record contains both the formula and the resulting data. The value displayed in the cell is defined in a FormulaValue structure in the 8 bytes that follow the cell structure. The next 6 bytes can be ignored, and the rest of the record is a CellParsedFormula structure that contains the formula itself.
A MulBlank record specifies a series of blank cells in a row.
The first 2 bytes give the row, and the next 2 bytes give the column that the series of blanks starts at. Next, a variable length array of cell structures follows to store formatting information, and the last 2 bytes show what column the series of blanks ends on.
A MulRk record is like a MulBlank record, but instead of blank cells, a MulRk record consists of RK data in RkRec structures.
The Shared String Table (SST) contains all of the string values in the workbook. These values are referenced in the worksheet by LabelSst cell records. The first 8 bytes of the SST give the number of references to strings in the workbook and the number of unique string values in the SST. The rest is an array of XLUnicodeRichExtendedString structures that contain the strings themselves as arrays of characters. Bit 16 of this structure specifies whether the characters are 1 byte or 2 bytes each. You can extend both the SST structure and the XLUnicodeRichExtendedString structure by using Continue records if the number or length of strings exceed their bounds.
Extracting Data from Excel Files
All MS-XLS format file content lives in the sheet substreams. Although you could load every sheet substream indiscriminately, you gain more control and efficiency by using the BoundSheet8 records to locate just the sheets you want to read. Parsing of formulas and formatting information is beyond the scope of this article.
