Wednesday, December 24, 2025

Subtle C++ alignment and size issues

Short version: sizeof in C++ can be misleading and cause problems, particularly inside assignment overloads, and public/private has surprising effects on alignment.

Object alignment requirements can cause some subtle and hard-to-diagnose bugs in C++. The demonstrations here show two issues that have caused problems that I ended up debugging, and I hope they help others find similar issues buried in their code. (Issues seen with gcc-10, gcc-14, clang-14, clang-17, and others. Note that all examples here are distilled down from much more complex code, so many questions about alternative implementations are out of scope.)

First, some background.

Most C and C++ programmers are aware that struct and class members are placed on "natural" alignment boundaries. This placement is dependent on CPU architecture and a number of other factors (including "packed" options supported by some, but not all, compilers and architectures), but the most common rule is that by default each member is placed at an offset is that is a multiple of the size of the object. That is, a 'char' is on any byte boundary, a two-byte 'short' is on an even byte boundary, a four-byte 'int' is on a multiple of 4 bytes, and so on. If the offset after the preceding element doesn't provide the right alignment, then the compiler automatically inserts hidden padding to make it right. For example:

struct {
  /* must be at offset 0
     in the struct */
  char first;
  /* offset 2; 1 one byte
     pad inserted before */
  short second;
};
The overall size of that struct above may be either 3 or 4 (or perhaps even more), depending on the requirements of the platform. This is why it's somewhat common practice to put the large objects first in a structure, followed by smaller ones, or to group small objects together in clusters. Both strategies minimize this wasteful padding. Note that this padding is between successive members in a structure.

In general, these size-based rules apply to fundamental types, and do not directly apply to compounds (nested structures). Instead, a compound has composite size and alignment based on its contents. That is, a struct has a required alignment that is equal to the strictest (largest) requirement of any member inside, and not its overall size.

A subtle corollary of the above rules is that sizeof() on a struct (or class) must return a value that is rounded up based on the alignment of the strictest member inside, effectively producing trailing hidden padding. That's because sizeof() must always give the proper stride of objects within an array. For example:
struct {
  int val1;
  short val2;
};
This struct should have size 6 (assuming a 4-byte int and 2-byte short), with no padding. But the natural alignment of that 'int' is on a 4-byte boundary, so the sizeof() will be 8, making sure that adjacent array entries are all on natural alignment boundaries when multiple objects are allocated. Note that this shows trailing padding, after the last member.

The first problem, shown in align-surprise1.cpp (see attachments at the end of this post), is that the C++ compiler will naturally pack together member variables using an alignment that is less strict than the alignment required for the base class. This requires some explanation.

The definition of SimplePOD includes a naive optimization: we know that all of the members are just plain old data, and the class itself is not virtual (thus has no vft pointer to worry about), so why not take advantage of that, and use memset/memcpy rather than individually assigning each object?

The subtle error here is that the size being copied includes that alignment padding at the end, and the compiler is free to insert other, unrelated members inside that padding. It can't place them between members (as best I can tell), but the trailing padding is fair game.

In this case, this means that the data copied by the SimplePOD assignment operator includes an extra 4 bytes (actual length is 20 but sizeof is 24). This means that the private members in both TestClass and TestClass2 are overwritten by the assignment of the base class. This test just shows the effect of that issue, which is effectively memory corruption. In the case where I originally encountered this problem, the contents of a smart pointer was copied, resulting in a reference count mismatch and a crash.

You can see the problem demonstrated by running "make" and then executing "./surprise1-fail". The fixed version is built as "./surprise1-pass".

The fix (enabled by FIX_BUG) computes the actual size of SimplePOD and uses that for the copy. Note that commenting out the assignment operator overload in SimplePOD also fixes the problem, as the compiler will internally compute the correct amount to copy. C++ itself just provides no convenient means for the user to compute this value, which is important if an overload is needed by the overall class design.

The second problem is shown by align-surprise2.cpp and is even more shocking. In some cases, changing from "public" to private or protected will cause the member alignment and the overall object size both to change. This is demonstrated by the output of "./surprise2-public" and "./surprise2-private". The only difference is whether the members are public or private.

The surprise2-public output looks like this:
baz 12 foo 8
a 0
b 4
c 8
d 9
This shows that the size of baz is 12, and that the alignment of 'c' starts on the next int boundary. But surprise2-private (and surprise2-protected) show this:
baz 8 foo 8
a 0
b 4
c 5
d 6
The size of baz is down to 8 and the alignment of 'c' has changed to start on a char boundary. Again, the only change is the visibility of the members.

This has several implications. One is that if (say) you are debugging a problem and change some members from private to public just to simplify some temporary debug code, you might also be inadvertently changing the actual offset of those members within the object. If the entire project isn't recompiled, you could have mysterious behavior or even crashes as a result. Another is that seemingly innocuous improvements in C++ code (for example, making public members private and providing accessors instead) could easily change the size of the object and affect cache alignment, drastically altering performance by creating new opportunities for false sharing. Still another is that if you cast pointers back and forth between different classes, you may find that the actual offsets of members in those classes depend on the visibility specified, and thus memory corruption may occur.

It's a jungle out there!

https://www.workingcode.com/align-surprise.tar.gz
https://www.workingcode.com/align-surprise.zip 

Thursday, June 27, 2019

Funny C++ Initialization

If you have a class (call it "myclass") that has Plain Old Data (POD) members and does not have a defined constructor (i.e., uses the compiler-supplied default constructor), then "new myclass;" and "new myclass();" will do different things.  The parenthesis-free version will leave the POD uninitialized, but the with-parenthesis version will initialize the POD elements to zero.

If you define a constructor in the class, then the difference disappears.  Both forms of "new" result in leaving the POD uninitialized (as you may have originally expected).

Here's a test case that demonstrates the difference:

funny-init.cpp

Compiling (with g++) and running this little program produces:

Default constructor, no parenthesis:  42
Default constructor, parenthesis:     0
Explicit constructor, no parenthesis: 42
Explicit constructor, parenthesis:    42

As an additional bit of hilarity is that the default-constructor variant compiles to an actual object constructor that does nothing, but when you invoke "new" with parenthesis, the site of the "new" invocation is littered with extra instructions just to write zeros over the POD elements in the class.

One surprising place this difference shows up is with "struct" and placement new.  If you use placement new with a "struct" and you add the parenthesis, then the underlying storage is wiped clean.  Here's a test case for that:

placement-wipeout.cpp

And the corresponding output:

Default constructor, no parenthesis:  42
Default constructor, parenthesis:     0

It's hard to see how this is a helpful state of affairs, but forewarned is forearmed.

Tuesday, November 6, 2018

RTLD_DEEPBIND has deep surprises

Linux RTLD_DEEPBIND has at least one very deep problem: if you load a dynamic object with it, the symbols called from within that one object will ignore LD_PRELOAD and go directly to the dependency, but symbols within the dependency itself are still resolved via LD_PRELOAD.

A real-world example of this failure is with a dynamically loaded object that invokes malloc(3), strdup(3), and free(3).  Suppose we have an application using an LD_PRELOAD that interposes on malloc and free.  The call to malloc, strdup, and free from within the loaded object will go straight to libc, bypassing the preload as expected.  But the implementation of strdup inside libc invokes malloc on its own:

https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strdup.c;hb=HEAD

That invocation of malloc will go to the LD_PRELOAD library, not libc's local definition.  As a result, the pointer that the dynamic object gets back is from the LD_PRELOAD library's implementation of malloc.  If the dynamic object tries to free that pointer, it will go straight to the libc definition of free().  Unless the preload "just" an innocuous wrapper on the libc functions, and doesn't replace them outright, this will fail in spectacular ways.

Here's a demo of the sort of hilarity this causes:

https://www.workingcode.com/deep-disaster.tar

This program produces the following output with "make test":

LD_PRELOAD=./preload.so ./main
Doing normal test: in preload: Inside the normal library
Doing bound test: in libbound: Inside the normal library
Doing inside test: from inside: in preload: Inside the normal library
Doing bound inside test: in libbound: from inside: in preload: Inside the normal library

The first two test results are as expected.  The main program goes through the preload to get to the common library, and the deeply-bound library does not.  The third result is also fine, and represents the main program invoking a function inside the common library that invokes another library function, which redirects through the preload.  The fourth result is the problem.  The deeply-bound library invokes a function in the common library that in turn invokes another library function.  In this case, it (somewhat surprisingly) goes through the preload, even though the user probably expected that RTLD_DEEPBIND would avoid the use of the preload.

At best, it does so "sometimes."

Note that this means that almost any non-trivial use of RTLD_DEEPBIND is incompatible with (at least) the usual LD_PRELOAD=libtcmalloc.so type of wrapper.  Anything you load with RTLD_DEEPBIND is hopelessly compromised if it invokes libc functions that internally use malloc or free.  Or it means that any such wrapper must carefully wrap all exposed libc interfaces (such as strdup and fopen) that can allocate memory, and supply its own implementation -- a feat that may be impossible.

Friday, January 12, 2018

GNU make reports "process_begin: CreateProcess(NULL, pwd, ...) failed." on Cygwin

I very rarely build anything on Windows.  My day job doesn't call for it much, and I certainly wouldn't want to do that for "fun."  So, when I do, I often run into problems that (a) I don't understand and (b) nobody else seems to have seen before.  This is one of those sorts of problems.

My build on a new machine failed almost immediately after typing "make" with this error:
process_begin: CreateProcess(NULL, pwd, ...) failed.
I have no idea what that means.  The "pwd" command, of course, works just fine for me at the command line, and "whence" tells me it comes from /usr/bin.  Everything looks fine there.

After a lot of debugging and comparisons with others who had working configurations, I found this: near the front of my PATH, I had an entry like this:
/cygdrive/c/this/does/../not/exist
That's not the actual path, of course, but it gives the idea.  The bug I encountered is this: if you have a path that includes "/.." and if the previous directory in that path doesn't physically exist on that machine, then the path search function stops right there.  It doesn't look at the rest of the path entries at all.  So, if you have something like this before "/usr/bin", you're sunk.  That works fine on all UNIX and Unix-like systems, and it even works fine for the Cygwin shells.  But, for some reason, it doesn't work within GNU make's code that deals with path searches.

Changing that path so that it didn't have "/.." in it fixed the problem.

Sunday, July 2, 2017

Not Hertz

This happened many years ago, but I still think about it at times, and a recent exchange on Twitter made me remember that I really should have written it up a long time ago.  Better late than never, I suppose.

In 2009, my father was being treated for cancer, and his health was up and down.  I had been planning for some time to take a day of vacation from work on Friday, November 6th, 2009, and stay the weekend.  I had a flight booked on Jet Blue and a car reserved with Hertz.

On Monday, November 2nd, I got word that my father had taken a turn for the worse.  The radiation treatment had caused swelling in his throat, and he was rushed to the hospital to insert a trach tube.  After talking to my brother in Pittsburgh, I decided to change my plans.  I would fly out as soon as possible so I could be with my brother, father, and step-mother.

I called Jet Blue first.  No problem.  It was a $50 fee to change the flight, but I could have the first plane out in the morning on Tuesday.  That was pretty easy, and I was encouraged.

Then I called Hertz reservations through the 800 number.  I'd booked with them because I had a good bit of experience with them.  They were one of the preferred providers when I was at Sun, and I traveled a lot for work when I was there.  The customer service folks couldn't change my previous reservation, but they did offer an alternative: they could book a second car from the 3rd to the 6th, so I had a car the whole time, and they told me that the reservations counter in Pittsburgh would be able to help when I got there.  Nothing they could do by phone; it had to be done in person.  That was less than ideal, but what do I know about their reservations system?

The next morning, I flew to Pittsburgh, and got to the reservations counter.  They told me, no, they couldn't change the two contiguous reservations into one.  They had no idea why the 800 people told me that.  They told me I should call the 800 number again and ask for help.

So, I stepped out of line and into the waiting area, and called the 800 number again from my cell phone.  No dice.  They couldn't do anything for me and told me to talk to the people at the desk again.  So, I got back in line and waited again at the desk.  When I got up there, I asked if there was a manager I could talk to.  The answer was just "no."  They helpfully said that I should call the desk on Friday and ask to have it fixed before having to drive in.  They assured me there was no way I'd have to return just to swap cars; that would be silly.

I took the keys and went off to visit my father.  We had to make some really tough (and possibly wrong) decisions over the next few days.  It was a difficult time, but I'm very glad I made the trip.

On Friday, I called the Hertz desk in Pittsburgh.  I was told that, no, they could not help me.  I could not extend the current reservation.  I could not combine reservations.  The only thing I could do would be to return the car as agreed on Friday and then take another one out.  So, that night, I drove the 20 miles / 30 minutes to the airport, dropped off one ugly champagne colored Elantra, picked up a nearly identical champagne colored Elantra -- the only difference was that the XM radio worked in one and not in the other -- then drove back to my father's place.

That Sunday, November 8th, I dropped off the second car and returned to Boston.  That's the last time I've ever done business with Hertz, and the last time I ever will.

This sort of thing, to me, reeks of a corporate culture problem.  None of the customer service representatives had the remit to fix things.  In a company that is serious about customer service, the employees are given the power to "make things right" -- even if this means breaking company policies.  The service I got indicates the reverse.  Nobody I dealt with had the power to fix anything.

The problem is not necessarily being treated like dirt by every single customer service representative I dealt with.  The problem is having no prospect of things ever getting better, because it wasn't just one or two people having a bad day or not knowing how to make changes.  In fact, the representatives were pleasant to deal with, but completely unhelpful.  A systemic problem like that is something I can't put up with.  So when I need ground transport, it's anyone but them.

Thursday, November 24, 2016

"A start job is running for dev-disk-by" and other horrors

My desktop system at home is currently running OpenSUSE Tumbleweed.  It used to run Debian until I got caught in an upgrade version-locked disaster.  Then it ran OpenSolaris until Oracle made that an unlikely proposition.

I've been reasonably happy with OpenSUSE until I made the mistake of trying to do a "zypper dup" recently, and the reboot showed me this:

[***   ] A start job is running for dev-disk-by\x2duuid-1a0dc1c5\x2d26cc\x2d45ff\x2da7b1\x2d1f827c971ff9.device (15s / no limit)

As long as one might care to sit there and watch, it never completed whatever task it was trying to perform.

I was able to boot up with a rescue CD (thank goodness I downloaded that first), and was able to mount the disks with no trouble.  But no amount of fooling around would make it boot.  It was not a happy evening.

After quite a bit of fooling about, I discovered that there were two serious problems that I had to fix manually, and I'm writing this for those who might have run into similar problems:

1. dracut is missing bits

Crucial kernel drivers go missing when dracut builds a new initrd image, and other bits get included whether you like it or not.  I have a mix of file systems in use, and here are the new configuration bits I had to add to /etc/dracut.conf.d:

add_dracutmodules+="btrfs"
add_drivers+="btrfs zlib_deflate xor raid6_pq"
add_drivers+="md-mod raid1 raid456"
omit_drivers+="nouveau"

That it would exclude the RAID and btrfs drivers by default was very surprising.

2. udevd is broken by default

The default configuration of udevd simply doesn't work right.  It limits itself to an absurdly tiny number of processes, and ends up failing to run trivial scripts needed by the Linux "MD" disk subsystem.  That's a big part of my boot problem.  The solution is to create a file named /etc/systemd/system/systemd-udevd.service with this inside:

[Unit]
Description=udev Kernel Device Manager
Documentation=man:systemd-udevd.service(8) man:udev(7)
DefaultDependencies=no
Wants=systemd-udevd-control.socket systemd-udevd-kernel.socket
After=systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-sysusers.service
Before=sysinit.target
ConditionPathIsReadWrite=/sys

[Service]
Type=notify
OOMScoreAdjust=-1000
Sockets=systemd-udevd-control.socket systemd-udevd-kernel.socket
Restart=always
RestartSec=0
ExecStart=/usr/lib/systemd/systemd-udevd
MountFlags=slave
KillMode=mixed
WatchdogSec=3min
TasksMax=infinity

The important part is that "TasksMax=infinity" line.  That's what fixes the system so that it will actually boot again.

Saturday, August 29, 2015

Missing Windows users? The badly-named "net" command is your friend.

I have a lousy old laptop that I use for IMC Club presentations and the like.  I wish I could afford something better, but it mostly works almost well enough to keep me from bothering to look around.

Except every once in a while, it falls apart.  Windows is, unfortunately, really terrible in that way.  The latest problem it had is so strange and so obscure that I felt I had to write about it just in case someone else runs into it.

First symptom: all users accounts are gone from the login screen.  In fact, the login screen itself is skipped, and it goes straight to asking for the Admin password on boot.

On getting in, the "user accounts" tool shows nothing but Admin and Guest.  All user accounts are just plain gone.  Attempting to add the user accounts back results in an error message saying that the user "already has permission to access this computer."  Well, that's unhelpful.

The parental controls section shows the accounts.  The files are still there under C:\Users.  Everything seems in place, but nobody can log in.  Regedit shows nothing interesting.  Googling around for all sorts of related phrases shows that quite a few people have experienced this, but nobody has solved it.

I just solved it.  Typing "net user Jim", I can see output that ends like this:

    Local Group Memberships
    Global Group memberships   *None
    The command completed successfully.

I tried adding a dummy account, and it showed up with "Local Group Memberships" set to *Users.  That's the key.  For some reason, all of the accounts had been kicked out of the "Users" group, and that's why they were gone from the login screen.  Adding them back in looks like this:

    net localgroup Users Jim /add
    net localgroup Users Madeline /add

After doing that, the system was back to normal.  Ah, Windows.  Thanks for wasting so many hours of my life.