Universally Parsable PKGBUILDs
About
You can join in on the discussion about this here.
The current PKGBUILD format used in Arch Linux is a bash file which is sourced by makepkg, which is itself written in bash. This setup works well for makepkg and any other tools written in bash, but it makes parsing PKGBUILDs difficult in anything else. There are currently 2 alternatives for extracting PKGBUILD data when not coding in bash:
- source the PKGBUILD file with a bash script and have it format the output for your code
- implement a bash parser in your code
Both of these approaches have their limitations. Sourcing a bash file can expose the system to malicious script and it also incurs extra overhead. If you wish to build a package, it makes sense to inspect the PKGBUILD for nasty code, especially in the build function and the install script, but if you are parsing metadata such as dependency trees from hundreds or thousands of PKGBUILDs, then it is unreasonable to inspect all of the metadata.
Implementing a minimal bash parser is not a good solution either. While it would work for some PKGBUILDs which only include variables in outside of the build function, it would still fail on those which use anything else, such as the output of various commands. Not only would it be difficult and redundant to write a bash parser in each and every language one might use to parse a PKGBUILD, it would also expose your system to the same problems that you currently get from sourcing PKGBUILDs (malcious script if your parser has to run commands, overhead for more complicated parsing).
Because of these considerations and the inherently static nature of the metadata in a PKGBUILD, I've been playing around with the idea of an alternative PKGBUILD format.
Design Goals
Simplicity
Following Arch's overall KISS philosophy, the format needs to be simple. The main goal is to make something which is very easy to parse programmatically so that writing parsers will be nearly trivial. It also needs to remain simple for human readability to continue the tradition of easy package creation.
Extendability
The format should make it possible to add data that might be required later without breaking backwards compatibility. While this cannot be guaranteed, careful consideration now may prevent limitations and other issues in the future.
The Current Idea
This section will detail the current working example. This is still at the brainstorming stage and may never be more than an idea. If you see a problem or something that could be improved, join the discussion in the thread above.
Example of a possible PKGBUILD for powerpill. You can compare this syntax to the bash version here.
package {
name powerpill,
version 16.0,
release 12,
description "A wrapper for pacman that speeds up package retrieval " +
"by using aria2c for concurrent/segmented downloads.",
arch any,
license GPL,
backup "etc/powerpill.conf",
depends { "aria2", "perl", "perl-xyne-arch", "perl-xyne-common" },
url "http://xyne.archlinux.ca/info/powerpill",
sources {
"http://xyne.archlinux.ca/src/powerpill-16.0.tar.gz"
md5sum f3b443b6238029474ad9eb07ddc13ae0
},
build {<[[
install -D -m755 $srcdir/$pkgname/$pkgname $pkgdir/usr/bin/$pkgname
install -D -m644 $srcdir/$pkgname/man/$pkgname.1.gz $pkgdir/usr/share/man/man1/$pkgname.1.gz
install -D -m644 $srcdir/$pkgname/$pkgname.conf $pkgdir/etc/$pkgname.conf
]]>}
}
- partially inspired by JSON and YAML
- follows a key-value pair format
- inside double-quoted strings, the following are escaped (see below for "\$$"): \" \\ \n \t \$$
- single-quoted strings are not used
- "{<[[" and "]]>}" can be used to quote sections without interpolation (inspired by XML's CDATA format)
- leading and trailing whitespace is ignored
- linebreaks are ignored
- "+" can be used to concatenate strings across lines
- data can be arbitrarily nested for extensibility (see below)
Key-Value Pairs
Examples:
name powerpill
Here "name" is the key and "powerpill" is the value.
depends { "aria2", "perl", "perl-xyne-arch", "perl-xyne-common" }
Here "depends" is the key and everything after it is the value.
Points about key and values:
- keys may be single words or quoted strings
- values may be single words, quoted strings or blocks wrapped in curly braces
- a space separates a key from a value (maybe use a colon instead)
- key-value pairs are separated by commas
- single-word keys and values may not contain commas (this may be extended)
- keys always take a single value, but that value may be a block with further key-value pairs
- a value itself may be a key*
* Consider the following section for the sources:
sources {
"http://xyne.archlinux.ca/src/powerpill-16.0.tar.gz"
md5sum f3b443b6238029474ad9eb07ddc13ae0
},
Notice how there is no comma following the URL. This is not accidental. Here the URL behaves as a key and takes the key-value pair "md5sum f3b443b6238029474ad9eb07ddc13ae0" as its value. The reason for including "md5sum" is to enable the format to be extended for other checksums, such as sha1. Here's an example of how that might look:
sources {
"http://xyne.archlinux.ca/src/powerpill-16.0.tar.gz" {
md5sum f3b443b6238029474ad9eb07ddc13ae0,
sha1sum 4ceda6ec486aed5489ac848f2ca5cc44a413d5e9
}
},
Notice how we've added the curly braces after the URL. This is now required because the value is no longer a single key-value pair but rather a list of two key-value pairs, one for "md5sum" and one for "sha1sum".
Convenience Variables
With the current bash PKGBUILDs, you can do something like this to save some typing when you update the PKGBUILD:
pkgname=powerpill
pkgver=16.0
source=(http://xyne.archlinux.ca/src/${pkgname}-${pkgver}.tar.gz)
To achieve this with the example format given above, it would be possible to use a variable delimiter inside of strings to signify that something should be replaced. Consider this example:
package {
name powerpill,
version 16.0,
url "http://xyne.archlinux.ca/info/$$name$$",
sources {
"http://xyne.archlinux.ca/src/$$name$$-$$version$$.tar.gz"
md5sum f3b443b6238029474ad9eb07ddc13ae0
},
...
Here, "$$name$$" and "$$version$$" would be replaced with "powerpill" and "16.0", respectively. Additional variables could be specified as follows:
package {
variables {
svn "svn://example.com/this/that/whatever",
login anonymarch,
password "foo bar"
},
...
Variables delimited with "$$" would be replaced in both double-quoted strings and blocks quoted with the custom "{<[[" and "]]>}" tags, including the build function. Literal "$$" would require an escape in both ("\$$").
Block Quotes
As mentioned above, "{<[[" and "]]>}" can be used to delimit block quotes. The quote begins when "{<[[" is found after a key and it ends when "]]>}" is found on a line by itself. Whitespace around the closing tag may be ignored (at this point anyway). Variables delimited with "$$" will still be replaced in block quotes but nothing else. "$$" can be escaped in a block quote with "\$$".
Architecture-Specific Data
Some PKGBUILDs use if-then-else statements to handle different architectures. Here's how it would be handled in the current example:
package {
name foo,
version 1.3,
depends { depa, depb, depc },
arch {
i686 {
depends { depd, depe }
},
x86_64 {
depends { lib32-depd, lib32-depe }
}
},
...
The idea is that depends and any other value specified in the architecture section will be appended to the corresponding section for that package. In this example, if foo is built for i686 then it will depend on "depa", "depb", "depc", "depd" and "depe" but if it is built for x86_64 then it will depend on "depa", "depb", "depc", "lib32-depd" and "lib32-depe".
The architecture section could handle separate source URLs, versions, custom variables for the build function, etc.
Further Ideas
- maybe force quotes for all non-alphanumeric keys and values, e.g. "16.0"
- other stuff which I can't remember right now
Benefits Of Using This Format
The main benefit is the ability to securely and accurately parse a package's metadata. This would be very useful for handling submitted packages from unknown or untrusted sources such as the AUR. Taking the AUR as an example, this format would enable the server to reliably parse uploaded PKGBUILDs for all relevant data. It would then be possible to access this data remotely via the html and json-rpc interfaces. This could be used to improve tools such as yaourt which attempt to resolve dependencies for AUR packages.
I can already think of a few tools that would be greatly improved by this, such repo creation tools, source downloaders, dependency resolvers, etc.
The other main benefit is the liberation from bash. Bash is a great tool for basic scripting but it is limited in comparison to other scripting languages. Being able to write tools in one's language of choice would be a very good thing. This doesn't mean that we would replace all tools written in bash (bash is good when considering dependencies), only that we would have versatile alternatives (not limited to scripting languages either). There is no intrinsic connection between package metadata and bash and there should not be a forced link between them. It's the programming-language equivalent of vendor lock-in.
Other benefits include the introduction of uniformity to PKGBUILDs, easy extensibility and the possibility of collaboration with other package-tool developers. If PKGBUILDs become a true encapsulation of package metadata instead of just a makepkg plugin, there is a greater chance that they may be used as a packaging lingua franca. Although it's probably unlikely, the idea of an open package metadata standard is a nice one. It would be advantageous developers, packages and users alike.
Other Formats Which Have Been Considered
XML was briefly considered because it is very good at encapsulating data in a way that's easy to parse programmatically and there are numerous tools and libraries already available to handle XML. XML was rejected though because many people find XML difficult to edit manually. It also adds a lot of characters to the final file.
My inclusion of an XML example in the original post on the forum nearly derailed the entire discussion... many bricks were shat.
JSON was also considered but due to perceived limitations (e.g. verbatim quoting such as XML CDATA), it was also rejected.
A custom format that relied on whitespace similar to Python and Haskell was also proposed along with an example, but it was pointed out that whitespace requires special quoting in forum posts and emails, which can lead to corrupt PKGBUILDs.
Todo
- keep the discussion going on the forum and continue to improve the idea
- determine if it would be worthwhile to pursue further
- start writing some converters to switch between formats to test the viability of the idea
- write a parser in bash to test with makepkg
- reconsider some variable names for the sake of clarity and generality
- consider a new name for the file (to distinguish it from a bash PKGBUILD... perhaps PKGMETA)
Progress
I've written a trial version of a script to convert the current format to the current draft: pkgbuild2pkgmeta. There are still several things to do with it but the basic output should be sound. Check the comments at the top of the file for more information.