D3474R1
std::arguments

Draft Proposal,

This version:
https://isocpp.org/files/papers/P3474R0.html
Author:
Audience:
SG17, SG18, SG16
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
Source:
https://github.com/jeremy-rifkin/proposals/blob/main/cpp/arguments.bs

Abstract

This paper proposes an encoding-friendly and modern interface for accessing command line arguments throughout a program.

1. Credits

std::arguments was initially proposed by Izzy Muerte in [P1275]. Corentin Jabot and Aaron Ballman also proposed an interface for accessing command line arguments outside main to WG14 in [N2948]. This paper borrows wording, design elements, and good ideas from both.

2. Revision History

2.1. R1

Incorporated feedback from LEWGI regarding interface, wording, bags of bytes, bags of code points, etc.

3. Introduction

This paper aims to solve two main problems:

  1. Encoding and portability problems with command line arguments

  2. int argc, char** argv isn’t a modern-C++ way of representing information like this and is not friendly for novices

Encoding: The only standard means for accessing access command-line arguments in C++ is via int main(int argc, char** argv). This is a staple of C and C++, however, it’s not well-suited for portable applications because the encoding of argv varies system to system [[What is the encoding of argv?]]. On Windows, the native encoding is UTF-16 and it’s recommended to use wmain or WinMain instead of main for portable code. In order to facilitate the standard char** argv, UTF-16 arguments must be converted using legacy Windows code pages. Even on Unix-based systems the encoding of char** argv is not always clear.

Modernity: Passing arrays via a pointer and length argument is a very antiquated pattern rendered obsolete by modern solutions such as std::span. main is one of the last places in C++ where where separate pointer and length arguments are still needed. A modern signature for main along the lines of int main(std::span<char*> argv), int main(std::span<std::string_view> argv), or int main(std::argument_list argv) was previously rejected by the committee due to concerns surrounding complexity, overhead, and encoding issues [P0781]. An important benefit to a modern interface is teachability: Currently command line arguments require introduction to pointers relatively early on in education as well as subjection to footguns such as argv[1] == "--help" and confusion about the difference between C strings and C++ strings. This adds pitch to an already hazardously steep learning curve.

Tackling both of these problems requires a solution independent of main, as changes to main’s signature have previously been rejected. As such, this paper proposes a std::arguments() function which provides an object that can be used for accessing command line arguments.

While it’s not a primary goal of this paper, this interface results in providing some additional helpful functionality:

Access outside main: In some cases it may be desirable to access command line arguments outside of main and even to do so before main. Some examples could include:

Currently, command line arguments are only available inside of main which requires a programmer to manually pass this information throughout the program or create their own global storage for arguments. This can add clutter and introduce unnecessary complexity, especially if argument handling doesn’t happen "close" to main. There is precedent from other languages for global access, notably languages such as Python, Go, Rust, Swift, Ruby, C#, Haskell, Ada, and many others provide an interface for accessing arguments from anywhere in a program. Additionally, many C++ frameworks make arguments available outside main, such as QT with QCoreApplication::arguments.

4. Previous Straw Polls and Discussion

Early polling surrounding an alternative to argc/argv and a means of accessing arguments outside of main occurred during discussion of [P0781]:

POLL: A trivial library solution for iterating parameters?
SF F N A SA
2 12 14 2 1
POLL: A non-main-based way of fetching command line arguments?
SF F N A SA
7 9 9 1 2

Polls on [P1275] by LEWGI:

POLL: We should promise more committee time to the std::arguments part.
Unanimous consent
Attendance: 11

POLL: std::arguments should be available before main
SF F N A SA
6 0 3 1 0
Attendance: 11

Polls on [P1275] by SG16:

POLL: std::environments and std::arguments should follow the precedent set by std::filesystem::path.
SF F N A SA
4 6 1 0 2
Attendance: 14
POLL: std::environment and std::arguments should return a bag-o-bytes and conversion is up to the user.
SF F N A SA
3 4 2 1 2
Attendance: 14

Key concerns discussed included mutability of arguments, overhead of initializing data structures before main, and how to handle different encodings.

LEWGI discussion on P3474R0:

5. Implementability

On Windows, command line arguments can be accessed by GetCommandLineW. This function returns the command line as a string which must then be tokenized. This is called by the Windows CRT during startup to populate argv for main. The Windows CRT also provide __argv and __wargv global variables but only populates one depending on __UNICODE__. Additionally, neither may be populated if the command line parsing is disabled via options tailored to applications trying to minimize startup time.

On MacOS, _NSGetArgv and _NSGetArgc can be used to access argc and argv outside of main. These are both trivial functions that don’t allocate.

Implementation on other Unix-based systems is more challenging. There are four options:

  1. Modify libc to store argv and argc globally, e.g. __argc and __argv, similar to __environ. (reference implementation for this from N2948).

  2. Alternatively, store argc and argv from the program’s entry point. This would only require compiler support instead of a libc change.

  3. Use __dl_argv which exists in glibc. Unfortunately, absent a glibc change, looping through __dl_argv would be needed to determine argc as __dl_argc is hidden.

  4. Read from and tokenize /proc/self/cmdline. (this has length limitations)

  5. Use argc and argv passed by glibc to entries in the .init_array.

Approaches 2-4 are undesirable for various reasons. Approach 5 works on glibc but not necessarily other libcs and there are also implications with shared libraries. Approach 1 is most comprehensive but requires a paper.

6. Proposed Design

This paper proposes a function std::arguments, a class std::argument, and a header, <arguments>.

std::arguments() returns a span of const std::arguments corresponding to the program command line arguments.

std::argument mirrors the design of std::filesystem::path by providing observers that can convert to desired encodings. SG16 previously indicated a desire to follow the precedent of std::filesystem::path. Both paths and arguments can be encoded arbitrarily or even have no encoding; paths could be any sequence of bytes and command line arguments can be too. std::argument may be a view of a string or may own an allocation.

While it is not uncommon practice to modify the contents of char** argv, std::arguments returns a read-only span in order to not introduce dangers surrounding global mutable state. Whether changes made to argv in main are reflected in std::arguments is implementation-defined.

6.1. Design Considerations

The main design considerations come down to allocation, when potential tokenization or other argument preprocessing happens, and whether modifications to argv in main are reflected in std::arguments.

Reflecting argv modifications from main: It is desirable for std::arguments to contain the same values throughout the lifetime of a program and to not reflect changes to argv in main. Unfortunately, this would require allocation and copying on some systems. On Unix-based systems all means to access argv will reflect changes to argv in main, including /proc/self/cmdline. Discussion on [P1275] and [P0781] made clear that any overhead before main in the case of programs that don’t use std::arguments is unacceptable. Unfortunately, an initializer similar to std::ios_base::Init isn’t an option due to shared libraries not necessarily being loaded before main. Additionally, with import std; this would translate to overhead before main that is not pay for what you use. Due to implementations challenges, this paper leaves behavior implementation-defined in the case of argv being modified in main.

Saving strlen: On Unix-based systems, producing string views for arguments will involve a strlen. It may be desirable to save the result of this computation, however, the issue of modification mostly rules this out. While the storage for the arguments from the system will always be there, the pointers in argv could be modified and detecting this would be sufficiently complicated, involve overhead, or in general may be impossible. Because of this, every access of an argument string view will require a strlen unless the implementation makes copies of argv string entries. It would likely be undesirable to make it undefined behavior to use std::arguments after modifications in main so this paper leaves the possibility of a strlen cost open.

Preprocessing: On Windows GetCommandLineW will return a string which needs to be split into individual arguments. It may be desirable in some use-cases to only split this string lazily with an input-iterator interface for arguments. This paper does not suggest any design constrained to input-iteration, though, as much use will want more general access and iteration abilities and will require having tokenized all arguments anyway - whether by looping through all the arguments or even just looking at the argument count.

Backing storage for std::arguments: On Unix-based systems it would be simple for std::arguments to not involve any allocation and simply provide iterators over argv that dereference to ephemeral std::argument objects. Unfortunately, this would prevent the iterator from satisfying the Cpp17RandomAccessIterator requirements, container requirements, and may be error prone in the case of trying to store a reference to a std::argument. The proposed requirements here will require backing storage.

Global singleton, a function returning a span, or construction: TODO std::arguments could be implemented as a global singleton similar to std::cout, a std::arguments function returning a reference to a singleton, or as an object that the user constructs. While an object the user constructs potentially results in allocation at multiple points in a program, as well as possibly seeing different values if argv is modified in main, it’s also desirable to allow the std::arguments allocation to be cleaned up. As such, this paper proposes a std::arguments class which may perform allocation and various preprocessing at construction.

Globs and argv[0]: On Unix-based systems glob expansion is done by the shell. On Windows it is neither done by the shell or the Windows CRT. This paper proposes std::arguments should correspond directly to argv in main without any additional glob expansion. This paper also does not propose any special handling for the first entry of argv.

Comparison with other performance-oriented languages: Rust’s std::env::args() function creates an Args object which involves creating a vector of strings in the OS native encoding, copying from argv on Unix-based systems and tokenizing on Windows. Rust accesses argv and argc on most Unix-based systems by placing an initializer in the .init_array. Rust doesn’t have to worry about modification of argv in main.

Because the design of this library feature involves a lot of tradeoffs, it is the goal of this paper to offer as much implementation flexibility as possible.

7. Ergonomics

While most large applications should probably use a library for argument parsing, it is my hope that in the case of more ad-hoc argument parsing it would be possible to portably write a check such as std::arguments().at(1) == "--help" or std::arguments().at(1).native() == "--help". Another helpful operation would be .starts_with("--"). Unfortunately, encoding makes these operations challenging to do portably.

Because encoding will vary between systems and native() is implementation-defined, currently the only way to do this would involve the overhead of creating a string for a given encoding or an ugly macro to create a platform-dependent string literal:

// The overhead here is unfortunate but OK for 99% of uses
if(std::arguments().at(1).string() == "--help") {
  // ...
}

ok

#ifdef _WIN32
#define ARG(str) L##str
#else
#define ARG(str) str
#endif
if(std::arguments().at(1).native() == ARG("--help")) {
  // ...
}

A UDL could also be considered, however, this is a more general problem that, in the author’s opinion, should be addressed directly rather than through a bespoke solution. The problem of operations between strings of different encodings would best be tackled in another paper.

Alternatively, since this paper is targeting C++29 maybe transcoding[P2728] will solve all problems:

if(std::arguments().at(1) | std::uc::to_utf8  == u8"--help") {
  // ...
}

If transcoding is seen as desirable here, std::argument should provide some helper to do the system encoding to utf-N conversion

8. Bikeshedding

This paper uses the std::arguments naming from [P1275], however, the name is subject to bikeshedding. One point brought up on the mailing list was that arguments is a very generic name and it might be desirable to reserve it for future use. Some names that could be considered instead include:

Naming in other notable languages:

In a very informal approval-voting-style poll on the Together C & C++ Discord server (participants were asked to vote for all they found appealing) members showed a strong preference for either std::arguments or std::argv with eight and 17 votes respectively. Other options had no more than two votes. N.b.: The last option, std::process::arguments, came up after the poll was started and thus wasn’t captured in the poll.

9. Reference Implementation

TODO: Update for R1

A reference implementation / proof of concept is at https://github.com/jeremy-rifkin/arguments.

10. Proposed Wording

Wording is relative to [N4950] and borrows extensively from existing wording.

Insert into [headers] table 24:

<arguments>

Insert into [version.syn]:

#define __cpp_lib_arguments                         20????L // freestanding, also in <arguments>

Insert a new section [arguments]:

Header <arguments> synopsis [arguments.syn]

namespace std {
  class argument;
  span<const argument> arguments();
  template<class Allocator = allocator<argument>>
    span<const argument> arguments(const Allocator&);
}

Function arguments [arguments.arguments]

The function and function template arguments return read-only spans of argument objects corresponding to arguments passed to the program.

namespace std {
  span<const argument> arguments();
  template<class Allocator = allocator<argument>>
    span<const argument> arguments(const Allocator&);
}

span<const argument> arguments();

Effects: Returns a span of const argument objects representing the program’s arguments.

Throws: May throw if allocation throws.

template<class Allocator = allocator<argument>>
  span<const argument> arguments(const Allocator&);

Effects: Returns a span of const argument objects representing the program’s arguments.

Throws: May throw if Allocator::allocate throws.

Class argument [arguments.argument]

An object of class argument is a view of a character string argument passed to the program in an operating system-dependent format.

It is implementation-defined whether, in a main function with signature main(int argc, char** argv), any modifications to argv are reflected by an argument.

namespace std {
  class argument {
  public:
    using value_type  = /* see below */;
    using string_type = basic_string<value_type>;
    using string_view_type = basic_string_view<value_type>;

    // [arguments.argument.native], native observers
    const string_view_type native() const noexcept;
    const string_type      native_string() const;
    const value_type*      c_str() const noexcept;
    explicit operator string_type() const;
    explicit operator string_view_type() const noexcept;

    // [arguments.argument.obs], converting observers
    template<class EcharT, class traits = char_traits<EcharT>,
              class Allocator = allocator<EcharT>>
      basic_string<EcharT, traits, Allocator>
        string(const Allocator& a = Allocator()) const;
    std::string    string() const;
    std::wstring   wstring() const;
    std::u8string  u8string() const;
    std::u16string u16string() const;
    std::u32string u32string() const;

    filesystem::path path() const;

    // [arguments.argument.compare], comparison
    friend bool operator==(const argument& lhs, const argument& rhs) noexcept;
    friend strong_ordering operator<=>(const argument& lhs, const argument& rhs) noexcept;

    // [arguments.argument.ins], inserter
    template<class charT, class traits>
      friend basic_ostream<charT, traits>&
        operator<<(basic_ostream<charT, traits>& os, const argument& a);
  };

  // [arguments.argument.fmt], formatter
  template<typename charT>
    struct formatter<argument, charT>
      : formatter<argument::string_view_type, charT> {
        template<class FormatContext>
          typename FormatContext::iterator
            format(const argument& argument, FormatContext& ctx) const;
    };
}

Conversion [arguments.argument.cvt]

The native encoding of an ordinary character string is the operating system dependent current encoding for arguments. The native encoding for wide character strings is the implementation-defined execution wide-character set encoding ([character.seq]).

For member functions returning strings, value type and encoding conversion is performed if the value type of the argument or return value differs from argument::value_type. For the return value, the method of conversion and the encoding to be converted to is determined by its value type:

If the encoding being converted to has no representation for source characters, the resulting converted characters, if any, are unspecified.

Native Observers [arguments.argument.native]

The string returned by all native observers is in the native default argument encoding ([arguments.argument.cvt]).

const string_view_type native() const noexcept;

Returns: A string_view_type representing the argument.

const string_type native_string() const;

Returns: A string_type representing the argument.

const value_type* c_str() const noexcept;

Returns: A pointer to a null-terminated array of value_type representing the argument.

operator string_type() const;

Returns: A string_view_type representing the argument.

operator string_view_type() const noexcept;

Returns: A string_type representing the argument.

Converting Observers [arguments.argument.obs]

template<class EcharT, class traits = char_traits<EcharT>,
          class Allocator = allocator<EcharT>>
  basic_string<EcharT, traits, Allocator>
    string(const Allocator& a = Allocator()) const;

Returns: A string representing the argument.

Remarks: All memory allocation, including for the return value, shall be performed by a. Conversion, if any, is specified by [arguments.argument.cvt].

std::string string() const;
std::wstring wstring() const;
std::u8string u8string() const;
std::u16string u16string() const;
std::u32string u32string() const;

Returns: A string representing the argument.

Remarks: Conversion, if any, is specified by [arguments.argument.cvt].

filesystem::path path() const;

Returns: A filesystem::path corresponding to the argument.

Comparison [arguments.view.compare]

friend bool operator==(const argument& lhs, const argument& rhs) noexcept;

Effects: Equivalent to: return lhs.native() == rhs.native();.

friend strong_ordering operator<=>(const argument& lhs, const argument& rhs) noexcept;

Effects: Equivalent to: return lhs.native() <=> rhs.native();.

Inserter [arguments.argument.ins]

template<class charT, class traits>
  friend basic_ostream<charT, traits>&
    operator<<(basic_ostream<charT, traits>& os, const argument& a);

Effects: Equivalent to: return os << a.string<charT, traits>();.

Formatter [arguments.argument.fmt]

template<class FormatContext>
  typename FormatContext::iterator
    format(const argument& argument, FormatContext& ctx) const;

Effects: Equivalent to: return std::formatter<argument::string_view_type>::format(argument.string<charT, char_traits<charT>>(), ctx);.

References

Normative References

[N4950]
Thomas Köppe. Working Draft, Standard for Programming Language C++. 10 May 2023. URL: https://wg21.link/n4950

Informative References

[N2948]
Accessing the command line arguments outside of main(). URL: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2948.pdf
[P0781]
A Modern C++ Signature for main. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0781r0.html
[P1275]
Desert Sessions: Improving hostile environment interactions. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1275r0.html
[P2728]
Unicode in the Library, Part 1: UTF Transcoding. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2728r7.html
[What is the encoding of argv?]
What is the encoding of argv?. URL: https://stackoverflow.com/questions/5408730/what-is-the-encoding-of-argv