body { width 85% }; A { background: #FFF; color: #44F; } A:hover { color: #20b2aa; } A.anchor { color: #000; } A.anchor:hover { color: #000; } P { text-indent: 2em; } ul li { list-style-type: lower-roman; }

Design of ustr, the micro string API - for C

Non-Magic version of a ustr

/* As the saying goes:
 *   "you can tell a lot about the code from the data structure".
 *
 * So this is the non-magic version of struct Ustr:
 *
 * struct Ustr
 * {
 *   const char *ptr;
 *   size_t reference_count;
 *   size_t length;
 *   size_t size;
 *
 *   unsigned char flag_is_readonly            : 1;
 *   unsigned char flag_is_fixed               : 1;
 *   unsigned char flag_exact_byte_allocations : 1;
 *   unsigned char flag_has_enomem_error       : 1;
 *   unsigned char flag_fail_on_alloc          : 1;
 * };
 *
 * ...however this "eats" memory for "small" strings, which is one reason given
 * why people don't use a real string ADT.
 *
 */

Magic/real version of ustr

struct Ustr 
{
 unsigned char data[1];
 /* 0b_wxzy_nnoo =>
  * 
  *    w         = allocated
  *     x        = has size
  *      y       = round up allocations (off == exact allocations)
  *       z      = memory error
  *         nn   = refn
  *           oo = lenn
  *
  * 0b_0000_0000 = "", const, no alloc fail, no mem err, refn=0, lenn=0 */
};

struct Ustrp
{ /* This is just used as a way to catch errors at compile time with static
   * typing. You might think we'd define Ustrp to be identical to Ustr, and
   * just cast however aliasing rules screw us over if we do that. */
  struct Ustr s;
};

Unused bit patterns for a ustr

/* Unused bit patterns for data[0]:
 *
 * 0bxxxx_x100 == lenn = 0
 * 0bxxxx_1x00
 * 0bxxx1_xx00
 * 0bxx1x_xx00
 *              0bx1xx_xx00 => is used in sized
 * 0b1xxx_xx00
 *
 * 0bx1xx_xx11 == 128bit size_t, yeh!
 * 0bx1xx_11xx
 *
 * 0b0xxx_x1xx == refn values for const/fixed
 * 0b0xxx_1xxx
 *
 * 0b00x1_xxxx == mem error for const
 * 0b001x_xxxx == not exact for const
 *
 */

English description of struct Ustr

/*
 *    ================ English description of struct Ustr ================
 *
 *  This strucure is very magic, as an optimization "" is a valid "structure"
 * for an empty string, ustr_len("") == 0, ustr_size("") == 0,
 * ustr_cstr("") == "".
 *  This also works for other "static data strings" if they start with the first
 * four bits as 0 ... although for C const strings using this, the length has
 * to be manually defined correctly at compile time.
 *
 *  However it is also a "space efficient" String ADT, with a managed length,
 * _either_ with or without reference counts (so that ustr_dup() doesn't copy)
 * and with or without a stored size (so growing/shrinking a lot doesn't
 * cause lots of malloc/realloc usage). Note that for lengths of strings, or
 * reference counts that need >= 8 bytes of storage the Ustr _much_ contain
 * a stored size.
 * 
 *  It can also track memory allocation errors over multiple function calls.
 *
 *  If the first byte == 0, then it's "" as above.
 *  Otherwise the first byte contains four flags, and 2 numbers (2 bits each).
 *
 *  The 1st flag says if the string is allocated. The 2nd flag says if it has a
 * stored size. The 3rd flag says if it isn't using "exact allocations", if it
 * is ustr_len() + ustr_overhead() == ustr_size_alloc(), otherwise there might
 * be room to grow without doing a memory allocation call[1]. The 4th flag says
 * if this Ustr has had a memory allocation error.
 * The two numbers are the mappings for the length of the reference count, and
 * the length of the length. Note that this mapping changes if the 2nd flag is
 * set.
 *
 * [1] There is an implied "size" of the string, based an how big it is. This
 * is a concession for speed efficiency, although it only grows at 1.5x
 * not 2x which is the norm. (again, to reduce memory waste). Again if this is
 * too much overhead, just use exact sized allocations.
 *
 *  Also NOTE if the 1st and 2nd flags are set, this means that the Ustr is in
 * fixed storge (like an auto array). Also if the 1st, 2nd and 3rd flags are
 * set, this means that the Ustr is limited, Ie. it's in fixed storge and
 * _cannot_ grow (all allocation attempts will fail).
 *
 *
 *  Next, possibly, comes the reference count (of the given length[2]).
 *  Next, if not "", comes the length of the data (of the given length[2]).
 *  Next, if non-zero length, comes the data, which can include NIL bytes.
 *  And finally, if not "", a NIL byte, to make sure it's always a valid C-Style
 * string (although obviously any embeded NIL bytes will make it look shorter
 * if something else just uses strlen(ustr_cstr())).
 *
 * [2] The sizes can currently be 1, 2, 4 or 8 bytes. Depending on the mapping
 * of the value in the first byte (length changes dynamically, as you add data).
 *
 *  Examples:
 *
 *  The allocated string "a", using exact allocations and no reference
 * counting, is the 4 bytes:
 *
 *     {0x81, 0x01, 'a', 0x00}
 *
 *
 *  The allocated string "ab", using non-exact allocations and 1 byte reference
 * counting (shared by 13 users), is the 6 bytes (there is no waste due to
 * non-exact allocations):
 *
 *     {0xA5, 0x0D, 0x02, 'a', 'b', 0x00}
 *
 *
 *  The allocated string "abcdefghijklmnop" (16 bytes, and a NIL), with
 * non-exact allocations 2 byte reference counting (shared by 1,003 users), is
 * the 24 bytes (with 3 bytes of "unused" space at the end):
 *
 *     {0xA9, 0x03, 0xEB, 0x10, 'a', 'b', [...], 'o', 'p', 0x00, <x>, <x>, <x>}
 */

Number of options for strings

This design allows 83 different combinations of options for your string types:


James Antill
Last modified: Sat Jun 23 11:30:14 EDT 2007