Automating C++ Mobile SDK Binary Size Analysis with Pandas Within a CI Pipeline


The merge request looked innocent. A minor refactoring of a C++ template metaprogramming utility used across our mobile SDK. Functionally, it was a net positive. Yet, when the final artifacts were built, the Android AAR had swelled by 250KB and the iOS static framework by nearly 300KB. This wasn’t a one-off event; it was a symptom of a chronic disease. Our SDK was suffering from binary size bloat through a death by a thousand cuts, and our manual “check the file size” step in the release process was clearly inadequate. We needed a system that could not only detect size regressions but also pinpoint the exact symbols responsible, automatically, on every single commit.

Our initial concept was to treat build artifacts as structured data sources, not opaque blobs. This meant our CI pipeline needed to evolve from a simple compiler to an analytical engine. The core idea was to establish a baseline from our main branch and compare every merge request’s generated artifacts against it. A deviation beyond a certain threshold would not just fail the build, but provide a detailed report explaining why.

For a C++ project, a simple file size check is meaningless. The 300KB increase could be from a single bloated function, debug symbols that were accidentally included, or the instantiation of thousands of small template functions. We needed a tool that could dissect the binary. We evaluated nm and objdump, but they required significant parsing work. Google’s bloaty was the perfect fit. It analyzes binaries, providing a detailed, hierarchical view of symbol sizes, and critically, it can output this data in a clean, machine-readable format like CSV.

The CI platform itself was GitLab CI. Its artifact and caching mechanisms are robust, and the ability to define complex multi-stage pipelines in YAML was essential. We decided on a three-stage pipeline: build, analyze, and report. The build stage would compile the C++ code for our target mobile platforms (Android and iOS) and run our performance benchmarks. The analyze stage would be the core of the new system, running bloaty and a custom analysis script. The report stage would then present the findings.

The final piece of the puzzle was the analysis itself. Chaining together awk, sed, and grep in a shell script felt brittle and destined to become unmaintainable. The data from bloaty and our performance benchmarks (which we configured to output JSON) was tabular. This immediately brought Python and the Pandas library to mind. Using Pandas inside a lightweight Docker container in our CI job would give us the power to load, merge, diff, and query our build data with remarkable flexibility, far surpassing what shell scripting could sanely offer.

Phase 1: Generating Actionable Build Artifacts

Before any analysis, we must first produce consistent and data-rich artifacts. The key here is reproducibility. Running a build on a developer’s machine versus a CI runner can yield different results due to toolchain versions or environment variables. The only sane solution is to lock down the build environment in a Docker image.

Our .gitlab-ci.yml starts by defining these build jobs. Notice the explicit creation of build data artifacts (bloaty-report.csv, perf-report.json) alongside the actual binary.

# .gitlab-ci.yml

stages:
  - build
  - analyze

variables:
  SDK_NAME: "MyMobileSDK"

.build_template: &build_definition
  stage: build
  image: registry.my-company.com/cpp-mobile-builder:1.2.0 # Custom image with NDK, iOS SDK, bloaty, etc.
  script:
    # 1. Configure and build the C++ core
    - cmake -B build -DCMAKE_TOOLCHAIN_FILE=$TOOLCHAIN_FILE
    - cmake --build build --config Release -j $(nproc)

    # 2. Run performance benchmarks and output to JSON
    # A common mistake is to not capture this data. Performance is a feature.
    - ./build/bin/sdk_benchmarks --benchmark_format=json > perf-report.json

    # 3. Run bloaty to generate a symbol size report in CSV format
    # We target the static library, which contains all our code symbols.
    - bloaty build/lib/${SDK_NAME}.a -d symbols,compileunits --csv > bloaty-report.csv

  artifacts:
    paths:
      - build/lib/ # The actual library for downstream jobs
      - bloaty-report.csv
      - perf-report.json
    expire_in: 7 days

build:android:
  <<: *build_definition
  variables:
    TOOLCHAIN_FILE: $ANDROID_NDK_HOME/build/cmake/android.toolchain.cmake

build:ios:
  <<: *build_definition
  variables:
    TOOLCHAIN_FILE: $CI_PROJECT_DIR/cmake/ios.toolchain.cmake

The Docker image cpp-mobile-builder:1.2.0 is critical. It contains a specific version of the Android NDK, the iOS toolchain, CMake, and the bloaty executable. This eradicates the “it works on my machine” problem.

A key implementation detail is the bloaty command: bloaty build/lib/${SDK_NAME}.a -d symbols,compileunits --csv > bloaty-report.csv. We aren’t just getting symbol names; we’re asking for the compile units they originate from. This is invaluable for tracing a bloated template function back to the .cpp file that instantiated it. The CSV output is structured for easy parsing:

compileunits,symbols,vmsize,filesize
/build/src/utils/string_helpers.cpp,[...],8192,8192
/build/src/utils/string_helpers.cpp,cool::sdk::StringUtils::Split(std::string const&),4096,4096
/build/src/api/user_profile.cpp,[...],4096,4096
/build/src/api/user_profile.cpp,cool::sdk::UserProfile::Update(cool::sdk::Json const&),2048,2048
...

Similarly, our google/benchmark-based performance suite now generates a perf-report.json, a structured representation of benchmark names and timings. This is far superior to parsing human-readable text output.

Phase 2: The Pandas Analysis Engine

With reliable data generation in place, the next stage is analysis. We define a new job in our GitLab CI pipeline that runs after the build jobs. This job uses a standard Python Docker image and installs the dependencies it needs.

# .gitlab-ci.yml (continued)

analyze:android_regressions:
  stage: analyze
  image: python:3.9-slim
  needs:
    - job: build:android
      artifacts: true
  before_script:
    - pip install pandas "python-gitlab>=3.0.0"
  script:
    # 1. Fetch the baseline artifacts from the target branch (e.g., main)
    # This is a critical step for comparison.
    - echo "Fetching baseline artifacts from target branch: $CI_MERGE_REQUEST_TARGET_BRANCH_NAME"
    - # Scripting to use GitLab API to find the last successful pipeline on the target branch and download its artifacts...
    - python ./scripts/download_artifacts.py --token $GITLAB_API_TOKEN --project $CI_PROJECT_ID --branch $CI_MERGE_REQUEST_TARGET_BRANCH_NAME --job "build:android" --output-dir baseline/
    
    # 2. Run the core analysis script
    - python ./scripts/analyze_build.py --baseline-bloaty baseline/bloaty-report.csv --current-bloaty bloaty-report.csv --baseline-perf baseline/perf-report.json --current-perf perf-report.json --size-threshold 1024 --perf-threshold 0.05

The real work happens in analyze_build.py. This script is the heart of the system. It ingests the four data files (baseline and current reports for size and performance) and performs a differential analysis.

Here is a production-grade version of that script. It includes argument parsing, error handling, logging, and the core Pandas logic.

# scripts/analyze_build.py

import pandas as pd
import argparse
import sys
import logging
import re

# Setup basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def demangle_cpp_symbol(symbol):
    """
    A simplified demangler. In a real project, you might use a library
    or a more robust regex. This is to normalize symbols for comparison,
    as compiler hashes can change.
    Example: _ZN3cool3sdk11StringUtils5SplitERKNSt3... -> cool::sdk::StringUtils::Split
    """
    # This regex is illustrative and may need adjustment for your compiler's mangling scheme
    match = re.search(r'(_Z\w+)(.*)', symbol)
    if match:
        # For this example, we'll just use the mangled name as it's often unique enough for diffing.
        # A true demangler (e.g., using abi::__cxa_demangle) would be better but requires a C++ helper.
        return symbol
    return symbol

def load_bloaty_report(filepath):
    """Loads a bloaty CSV report into a Pandas DataFrame."""
    try:
        df = pd.read_csv(filepath)
        # The pitfall here is that symbols can be duplicated if they appear in multiple compile units.
        # We group by symbol and sum their sizes to get a total size for each unique symbol.
        df['normalized_symbol'] = df['symbols'].apply(demangle_cpp_symbol)
        return df.groupby('normalized_symbol')[['vmsize', 'filesize']].sum().reset_index()
    except FileNotFoundError:
        logging.error(f"Bloaty report not found at: {filepath}")
        return None
    except Exception as e:
        logging.error(f"Failed to process bloaty report {filepath}: {e}")
        return None

def load_perf_report(filepath):
    """Loads a google-benchmark JSON report into a Pandas DataFrame."""
    try:
        df = pd.read_json(filepath)
        # Extract the benchmark context and the benchmarks themselves
        benchmarks = df['benchmarks']
        return pd.json_normalize(benchmarks)
    except FileNotFoundError:
        logging.error(f"Performance report not found at: {filepath}")
        return None
    except Exception as e:
        logging.error(f"Failed to process performance report {filepath}: {e}")
        return None

def main():
    parser = argparse.ArgumentParser(description="Analyze C++ build for size and performance regressions.")
    parser.add_argument('--baseline-bloaty', required=True, help="Path to baseline bloaty CSV report.")
    parser.add_argument('--current-bloaty', required=True, help="Path to current bloaty CSV report.")
    parser.add_argument('--baseline-perf', required=True, help="Path to baseline performance JSON report.")
    parser.add_argument('--current-perf', required=True, help="Path to current performance JSON report.")
    parser.add_argument('--size-threshold', type=int, default=512, help="Size regression threshold in bytes.")
    parser.add_argument('--perf-threshold', type=float, default=0.05, help="Performance regression threshold (e.g., 0.05 for 5%).")
    args = parser.parse_args()

    # --- Size Analysis ---
    logging.info("Starting binary size analysis...")
    base_size_df = load_bloaty_report(args.baseline_bloaty)
    curr_size_df = load_bloaty_report(args.current_bloaty)

    size_regressions = []
    if base_size_df is not None and curr_size_df is not None:
        # A common mistake is to use an inner merge, which would miss new or removed symbols.
        # An outer merge correctly captures additions, deletions, and changes.
        size_comp = pd.merge(
            base_size_df, curr_size_df, on='normalized_symbol',
            suffixes=('_base', '_curr'), how='outer'
        ).fillna(0)
        
        size_comp['delta'] = size_comp['vmsize_curr'] - size_comp['vmsize_base']
        
        # Find symbols that have grown significantly
        regressions_df = size_comp[size_comp['delta'] > args.size_threshold]
        if not regressions_df.empty:
            logging.warning(f"Found {len(regressions_df)} symbol(s) with size regressions.")
            size_regressions = regressions_df[['normalized_symbol', 'vmsize_base', 'vmsize_curr', 'delta']].sort_values(by='delta', ascending=False)

    # --- Performance Analysis ---
    logging.info("Starting performance analysis...")
    base_perf_df = load_perf_report(args.baseline_perf)
    curr_perf_df = load_perf_report(args.current_perf)

    perf_regressions = []
    if base_perf_df is not None and curr_perf_df is not None:
        # We only care about cpu_time for this analysis.
        perf_comp = pd.merge(
            base_perf_df[['name', 'cpu_time']],
            curr_perf_df[['name', 'cpu_time']],
            on='name', suffixes=('_base', '_curr'), how='inner'
        )

        perf_comp['delta_perc'] = (perf_comp['cpu_time_curr'] - perf_comp['cpu_time_base']) / perf_comp['cpu_time_base']
        
        regressions_df = perf_comp[perf_comp['delta_perc'] > args.perf_threshold]
        if not regressions_df.empty:
            logging.warning(f"Found {len(regressions_df)} benchmark(s) with performance regressions.")
            perf_regressions = regressions_df[['name', 'cpu_time_base', 'cpu_time_curr', 'delta_perc']].sort_values(by='delta_perc', ascending=False)

    # --- Reporting ---
    has_regressions = not size_regressions.empty or not perf_regressions.empty
    
    if not has_regressions:
        logging.info("Analysis complete. No significant regressions found.")
        sys.exit(0)

    print("\n--- BUILD REGRESSION REPORT ---")
    if not size_regressions.empty:
        print("\n## 🚨 Binary Size Regressions (Bytes)")
        print(size_regressions.to_markdown(index=False))

    if not perf_regressions.empty:
        print("\n## ⏱️ Performance Regressions")
        perf_regressions['delta_perc'] = (perf_regressions['delta_perc'] * 100).map('{:.2f}%'.format)
        print(perf_regressions.to_markdown(index=False))
    
    logging.error("Significant regressions detected. Failing the pipeline.")
    sys.exit(1)


if __name__ == '__main__':
    main()

The logic of merging the baseline and current dataframes is the core of this process. For size analysis, pd.merge(..., how='outer') is crucial. It ensures we catch symbols that were newly added (where vmsize_base will be NaN, which we fillna(0)) and symbols that were removed. We then calculate a delta and filter for any growth exceeding our --size-threshold. A similar inner merge is done for performance, as we can only compare benchmarks that exist in both runs.

Phase 3: Integration and Human-Readable Reporting

The script exiting with code 1 is enough to fail the CI job, but a failed log isn’t enough. Developers need a clear, concise report delivered directly to their merge request. The Python script prints a Markdown-formatted table to standard output, which is easy to read in the GitLab CI job log.

--- BUILD REGRESSION REPORT ---

## 🚨 Binary Size Regressions (Bytes)
| normalized_symbol                                                          |   vmsize_base |   vmsize_curr |   delta |
|:---------------------------------------------------------------------------|--------------:|--------------:|--------:|
| std::vector<cool::sdk::SpecialObject, std::allocator<...>>::emplace_back... |         12288 |         45056 |   32768 |
| cool::sdk::JsonParser::parse(char const*)                                  |          4096 |          9216 |    5120 |

## ⏱️ Performance Regressions
| name                                  |   cpu_time_base |   cpu_time_curr | delta_perc   |
|:--------------------------------------|----------------:|----------------:|:-------------|
| BM_StringSplitting/_real_time         |        1204.58  |        1513.22  | 25.62%       |

This output is transformative. Instead of a vague “binary size increased,” the developer sees the exact C++ symbol—std::vector<...>::emplace_back—that grew by 32KB. This immediately points them to a code change involving that specific template instantiation.

To take this a step further in a real-world project, the report stage of the pipeline would use a library like python-gitlab to post this Markdown table as a comment directly on the merge request, ensuring the feedback is impossible to ignore. This closes the loop, turning raw build data into actionable developer feedback.

# A snippet for posting a GitLab MR comment (part of a larger reporting script)
# import gitlab
# import os

# gl = gitlab.Gitlab('https://gitlab.com', private_token=os.environ['GITLAB_API_TOKEN'])
# project = gl.projects.get(os.environ['CI_PROJECT_ID'])
# mr = project.mergerequests.get(os.environ['CI_MERGE_REQUEST_IID'])
# mr.notes.create({'body': markdown_report_string})

This system provided the automated oversight we were missing. It stopped size bloat in its tracks and even caught an algorithmic performance regression before it ever reached our users. The combination of a disciplined build process, targeted data extraction with bloaty, and flexible analysis with Pandas gave us a level of insight into our C++ codebase’s health that was previously unimaginable.

The current implementation still has limitations. The baseline is always the latest successful build on main, which can be problematic if main itself has a regression. A more robust system would store historical data, allowing for trend analysis and more intelligent baseline selection. Furthermore, performance benchmarks can be “flaky” due to runner variance. The next iteration of this system should incorporate multiple runs and statistical significance tests to reduce false positives, ensuring that when the pipeline fails, it’s for a reason that genuinely requires a developer’s attention. The demangling logic is also simplistic; a production system would benefit from a proper demangler to handle more complex symbol names from different compilers.


  TOC