[内核文档翻译] Overlay文件系统

Overview

本文原文来自内核源码文档,源码链接

官方网站:https://docs.kernel.org

Overlay文件系统(Overlay Filesystem)

This document describes a prototype for a new approach to providing overlay-filesystem functionality in Linux (sometimes referred to as union-filesystems). An overlay-filesystem tries to present a filesystem which is the result over overlaying one filesystem on top of the other.

本文档描述了在 Linux 中提供Overlay(覆盖)文件系统功能(有时称为联合文件系统)的新方法的原型。 Overlay文件系统尝试呈现一种文件系统,该文件系统是将一个文件系统覆盖在另一个文件系统之上的结果。

Overlay对象 (Overlay objects)

The overlay filesystem approach is 'hybrid', because the objects that appear in the filesystem do not always appear to belong to that filesystem. In many cases, an object accessed in the union will be indistinguishable from accessing the corresponding object from the original filesystem. This is most obvious from the 'st_dev' field returned by stat(2).

While directories will report an st_dev from the overlay-filesystem, non-directory objects may report an st_dev from the lower filesystem or upper filesystem that is providing the object. Similarly st_ino will only be unique when combined with st_dev, and both of these can change over the lifetime of a non-directory object. Many applications and tools ignore these values and will not be affected.

Overlay文件系统方法是“混合”的,因为文件系统中出现的对象看起来并不总是属于该文件系统。 在许多情况下,在联合中访问的对象与从原始文件系统访问相应的对象没有什么区别。 这在 stat(2) 返回的“st_dev”字段中最为明显。

虽然目录将报告来自覆盖文件系统的 st_dev,但非目录对象可能会报告来自提供该对象的下层文件系统或上层文件系统的 st_dev。 类似地,st_ino 仅在与 st_dev 组合时才是唯一的,并且这两者都可以在非目录对象的生命周期中发生变化。 许多应用程序和工具会忽略这些值并且不会受到影响。

In the special case of all overlay layers on the same underlying filesystem, all objects will report an st_dev from the overlay filesystem and st_ino from the underlying filesystem. This will make the overlay mount more compliant with filesystem scanners and overlay objects will be distinguishable from the corresponding objects in the original filesystem.

On 64bit systems, even if all overlay layers are not on the same underlying filesystem, the same compliant behavior could be achieved with the "xino" feature. The "xino" feature composes a unique object identifier from the real object st_ino and an underlying fsid index. The "xino" feature uses the high inode number bits for fsid, because the underlying filesystems rarely use the high inode number bits. In case the underlying inode number does overflow into the high xino bits, overlay filesystem will fall back to the non xino behavior for that inode.

The "xino" feature can be enabled with the "-o xino=on" overlay mount option. If all underlying filesystems support NFS file handles, the value of st_ino for overlay filesystem objects is not only unique, but also persistent over the lifetime of the filesystem. The "-o xino=auto" overlay mount option enables the "xino" feature only if the persistent st_ino requirement is met.

在所有覆盖层都位于同一底层文件系统上的特殊情况下,所有对象都将报告来自覆盖文件系统的 st_dev 和来自底层文件系统的 st_ino 。 这将使覆盖挂载更符合文件系统扫描器的要求,并且覆盖对象将与原始文件系统中的相应对象区分开来。

在 64 位系统上,即使所有覆盖层不在同一底层文件系统上,也可以通过“xino”功能实现相同的合规行为。 “xino”功能由真实对象 st_ino 的唯一对象标识符和底层 fsid 索引组成。 “xino”功能使用 fsid 的高 inode 编号位,因为底层文件系统很少使用高 inode 编号位。 如果底层 inode 编号确实溢出到高 xino 位,覆盖文件系统将回退到该 inode 的非 xino 行为。

可以使用“-o xino=on”overlay挂载选项启用“xino”特性。 如果所有底层文件系统都支持 NFS 文件句柄,则Overlay文件系统对象的 st_ino 值不仅是唯一的,而且在文件系统的生命周期内是持久的。 仅当满足持久 st_ino 要求时,“-o xino=auto” overlay挂载选项才会启用“xino”功能。

The following table summarizes what can be expected in different overlay configurations.

下表总结了不同overlay配置下的预期结果。

Inode属性(Inode properties)

配置

持久化st_ino

统一的st_dev

st_ino == d_ino

d_ino == i_ino [*]

dir

!dir

dir

!dir

dir

!dir

dir

!dir

所有的层在同一个文件系统上

Y

Y

Y

Y

Y

Y

Y

Y

所有层在同一文件系统上, xino=off

N

N

Y

N

N

Y

N

Y

xino=on/auto

Y

Y

Y

Y

Y

Y

Y

Y

xino=on/auto, ino 溢出

N

N

Y

N

N

Y

N

Y

[*] nfsd v3 readdirplus verifies d_ino == i_ino. i_ino is exposed via several /proc files, such as /proc/locks and /proc/self/fdinfo/<fd> of an inotify file descriptor.

注:nfsd v3 readdirplus 验证 d_ino == i_ino。 i_ino 通过多个 /proc 文件暴露,例如 inotify 文件描述符的 /proc/locks 和 /proc/self/fdinfo/<fd> 。

上层和下层(Upper and Lower)

An overlay filesystem combines two filesystems - an 'upper' filesystem and a 'lower' filesystem. When a name exists in both filesystems, the object in the 'upper' filesystem is visible while the object in the 'lower' filesystem is either hidden or, in the case of directories, merged with the 'upper' object.

It would be more correct to refer to an upper and lower 'directory tree' rather than 'filesystem' as it is quite possible for both directory trees to be in the same filesystem and there is no requirement that the root of a filesystem be given for either upper or lower.

Overlay文件系统结合了两个文件系统——“上层”文件系统和“下层”文件系统。 当两个文件系统中都存在某个名字时,“上层”文件系统中的对象可见,而“下层”文件系统中的对象被隐藏,对目录来说则会与“上层”对象合并。

更准确的说法,上层和下层是“目录树”而不是“文件系统”,因为两个目录树很可能位于同一文件系统中,并且并不需要将文件系统的根目录给到上层或下层。

(译者注:上层和下层可以指定某个文件系统的某个目录,不需要一定要将文件系统的/目录给到某一层)

A wide range of filesystems supported by Linux can be the lower filesystem, but not all filesystems that are mountable by Linux have the features needed for OverlayFS to work. The lower filesystem does not need to be writable. The lower filesystem can even be another overlayfs. The upper filesystem will normally be writable and if it is it must support the creation of trusted.* and/or user.* extended attributes, and must provide valid d_type in readdir responses, so NFS is not suitable.

A read-only overlay of two read-only filesystems may use any filesystem type.

Linux 支持的多种文件系统都可以是下层文件系统,但并非所有 Linux 可挂载的文件系统都具有 OverlayFS 工作所需的特性。 下层文件系统不需要是可写的。 下层文件系统甚至可以是另一个overlayfs。 上层文件系统通常是可写的,如果是的话,它必须支持创建 trust.* 和/或 user.* 扩展属性,并且必须在 readdir 响应中提供有效的 d_type,因此 NFS 不符合。

两个只读文件系统组成的只读overlay fs可以使用任何文件系统类型。

目录(Directories)

Overlaying mainly involves directories. If a given name appears in both upper and lower filesystems and refers to a non-directory in either, then the lower object is hidden - the name refers only to the upper object.

Where both upper and lower objects are directories, a merged directory is formed.

At mount time, the two directories given as mount options "lowerdir" and "upperdir" are combined into a merged directory:

覆盖主要涉及目录。 如果给定名称同时出现在上层和下层文件系统中,并且指向任一文件系统中的非目录项,则下层对象将被隐藏 - 该名称仅引用上层对象。

当上层和下层对象都是目录时,就形成一个合并目录。

在挂载时,通过挂载选项“lowerdir”和“upperdir”指定的两个目录将组合成一个合并目录:

1mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,workdir=/work /merged

The "workdir" needs to be an empty directory on the same filesystem as upperdir.

Then whenever a lookup is requested in such a merged directory, the lookup is performed in each actual directory and the combined result is cached in the dentry belonging to the overlay filesystem. If both actual lookups find directories, both are stored and a merged directory is created, otherwise only one is stored: the upper if it exists, else the lower.

Only the lists of names from directories are merged. Other content such as metadata and extended attributes are reported for the upper directory only. These attributes of the lower directory are hidden.

“workdir”必须是与 upperdir 位于同一文件系统上的空目录。

然后,每当在这样的合并目录中执行查找动作时,都会在每个实际目录中执行查找,并将组合结果缓存在属于overlay文件系统的 dentry 中。 如果两层都找到目录,则两个目录都会被存储并创建一个合并目录,否则仅存储一个目录:如果上层存在则为上层目录,否则为下层目录。

仅合并目录中的名称列表。 其他内容(例如元数据和扩展属性)仅报告上层目录的。 下层目录的这些属性是隐藏的。

whiteout和不透明目录(whiteouts and opaque directories)

(译者注:whiteout没有找到广受认可的翻译,单独翻译单词有”白化“,”临时性失明“,"涂改液"的意思,我认为”涂改“是个不错的翻译。为保持准确,以下直接使用whiteout而不做翻译)

In order to support rm and rmdir without changing the lower filesystem, an overlay filesystem needs to record in the upper filesystem that files have been removed. This is done using whiteouts and opaque directories (non-directories are always opaque).

为了在不更改下层文件系统的情况下支持 rm 和 rmdir,overlay文件系统需要在上层文件系统中记录文件已被删除。 这是通过使用whiteouts和不透明目录来完成的(非目录总是不透明的)。

A whiteout is created as a character device with 0/0 device number. When a whiteout is found in the upper level of a merged directory, any matching name in the lower level is ignored, and the whiteout itself is also hidden.

whiteout以具有 0/0 设备号的字符设备的形态创建。 当在合并目录的上层发现whiteout时,下层中任何匹配的名称都会被忽略,并且whiteout本身也会被隐藏。

A directory is made opaque by setting the xattr "trusted.overlay.opaque" to "y". Where the upper filesystem contains an opaque directory, any directory in the lower filesystem with the same name is ignored.

通过将扩展属性(xattr)“trusted.overlay.opaque”设置为“y”,目录可以变得不透明。 如果上层文件系统包含一个不透明目录,则下层文件系统中的任何同名目录都会被忽略。

readdir(readdir)

When a 'readdir' request is made on a merged directory, the upper and lower directories are each read and the name lists merged in the obvious way (upper is read first, then lower - entries that already exist are not re-added). This merged name list is cached in the 'struct file' and so remains as long as the file is kept open. If the directory is opened and read by two processes at the same time, they will each have separate caches. A seekdir to the start of the directory (offset 0) followed by a readdir will cause the cache to be discarded and rebuilt.

当对合并目录发出“readdir”请求时,将分别读取上层目录和下层目录,并以明显的方式合并名称列表(首先读取上层目录,然后读取下层目录 -- 不会重新添加已存在的条目)。 此合并的名称列表缓存在'struct file'结构体中,因此只要文件保持打开状态,该列表就会保留下来。 如果目录同时被两个进程打开和读取,它们将各自拥有单独的缓存。 对目录开头(偏移量 0)执行 seekdir,然后再次readdir 将导致缓存被丢弃并重建。

This means that changes to the merged directory do not appear while a directory is being read. This is unlikely to be noticed by many programs.

这意味着目录正在被读取时对合并目录进行的更改不会显示出来。 许多程序不太可能注意到这一点。

seek offsets are assigned sequentially when the directories are read. Thus if

  • read part of a directory
  • remember an offset, and close the directory
  • re-open the directory some time later
  • seek to the remembered offset

there may be little correlation between the old and new locations in the list of filenames, particularly if anything has changed in the directory.

Readdir on directories that are not merged is simply handled by the underlying directory (upper or lower).

读取目录时,将按顺序分配查找偏移量。 因此如果

  • 读取目录的一部分
  • 记住偏移量,然后关闭目录
  • 稍后重新打开目录
  • 寻找记住的偏移量

文件名列表中的旧位置和新位置之间可能几乎没有相关性,特别是当目录中发生任何更改时。

未合并目录上的 Readdir 仅由底层目录(上层或下层)处理。

重命名目录(renaming directories)

When renaming a directory that is on the lower layer or merged (i.e. the directory was not created on the upper layer to start with) overlayfs can handle it in two different ways:

  1. return EXDEV error: this error is returned by rename(2) when trying to move a file or directory across filesystem boundaries. Hence applications are usually prepared to hande this error (mv(1) for example recursively copies the directory tree). This is the default behavior.

  2. If the "redirect_dir" feature is enabled, then the directory will be copied up (but not the contents). Then the "trusted.overlay.redirect" extended attribute is set to the path of the original location from the root of the overlay. Finally the directory is moved to the new location.

There are several ways to tune the "redirect_dir" feature.

当重命名位于下层或合并的目录时(即该目录不是在上层开始创建的),overlayfs 可以通过两种不同的方式处理它:

  1. 返回 EXDEV 错误:当尝试跨文件系统边界移动文件或目录时,rename(2) 返回此错误。 因此,应用程序通常都会做好准备处理此错误(例如 mv(1) 递归复制目录树)。 这是默认行为。

  2. 如果启用了“redirect_dir”功能,则将向上复制目录(但不复制内容)。 然后,“trusted.overlay.redirect”扩展属性被设置为从overlay的根开始的原始位置的路径。 最后该目录被移动到新位置。

有多种方法可以调整“redirect_dir”功能。

Kernel config options:

  • OVERLAY_FS_REDIRECT_DIR: If this is enabled, then redirect_dir is turned on by default.

  • OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW: If this is enabled, then redirects are always followed by default. Enabling this results in a less secure configuration. Enable this option only when worried about backward compatibility with kernels that have the redirect_dir feature and follow redirects even if turned off.

内核配置选项:

  • OVERLAY_FS_REDIRECT_DIR:

如果启用此选项,则默认情况下将打开redirect_dir。

  • OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW:

如果启用此功能,则默认情况下始终跟随重定向。 启用此功能会导致配置安全性降低。 仅当你关心与具有redirect_dir 功能的内核的向后兼容问题时才启用此选项,并且即使目录重定向被关闭也依然跟随重定向。

(译者注:”跟随重定向“应该指的是寻着”重定向扩展属性“去找到原目录或文件)

Module options (can also be changed through /sys/module/overlay/parameters/):

  • "redirect_dir=BOOL": See OVERLAY_FS_REDIRECT_DIR kernel config option above.

  • "redirect_always_follow=BOOL": See OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW kernel config option above.

  • "redirect_max=NUM": The maximum number of bytes in an absolute redirect (default is 256).

模块选项(也可以通过/sys/module/overlay/parameters/更改):

  • “redirect_dir=BOOL”: 请参阅上面的 OVERLAY_FS_REDIRECT_DIR 内核配置选项。

  • “redirect_always_follow=BOOL”: 请参阅上面的 OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW 内核配置选项。

  • “redirect_max=NUM”: 绝对重定向中的最大字节数(默认为 256)。

Mount options:

  • "redirect_dir=on": Redirects are enabled.

  • "redirect_dir=follow": Redirects are not created, but followed.

  • "redirect_dir=nofollow": Redirects are not created and not followed.

  • "redirect_dir=off": If "redirect_always_follow" is enabled in the kernel/module config, this "off" traslates to "follow", otherwise it translates to "nofollow".

挂载选项:

  • “redirect_dir=on”: 重定向已启用。

  • “redirect_dir=follow”: 重定向不会创建,但是会跟随。

  • “redirect_dir=nofollow”: 重定向不会创建也不跟随。

  • “redirect_dir=off”: 如果在内核/模块配置中启用了“redirect_always_follow”,则此“off”将转换为“follow”,否则它将转换为“nofollow”。

When the NFS export feature is enabled, every copied up directory is indexed by the file handle of the lower inode and a file handle of the upper directory is stored in a "trusted.overlay.upper" extended attribute on the index entry. On lookup of a merged directory, if the upper directory does not match the file handle stores in the index, that is an indication that multiple upper directories may be redirected to the same lower directory. In that case, lookup returns an error and warns about a possible inconsistency.

Because lower layer redirects cannot be verified with the index, enabling NFS export support on an overlay filesystem with no upper layer requires turning off redirect follow (e.g. "redirect_dir=nofollow").

当启用 NFS 导出功能时,每个向上拷贝的目录都会通过下层 inode 的文件句柄进行索引,上层目录的文件句柄存储在索引条目上的“trusted.overlay.upper”扩展属性中。 在查找合并目录时,如果上层目录与索引中的文件句柄存储不匹配,则表明多个上层目录可能被重定向到同一个下层目录。 在这种情况下,查找操作会返回错误并警告可能存在的不一致。

由于无法使用索引验证下层重定向,因此在没有上层的覆盖文件系统上启用 NFS 导出支持需要关闭重定向跟随(例如“redirect_dir=nofollow”)。

非目录(Non-directories)

Objects that are not directories (files, symlinks, device-special files etc.) are presented either from the upper or lower filesystem as appropriate. When a file in the lower filesystem is accessed in a way the requires write-access, such as opening for write access, changing some metadata etc., the file is first copied from the lower filesystem to the upper filesystem (copy_up). Note that creating a hard-link also requires copy_up, though of course creation of a symlink does not.

非目录的对象(文件、符号链接、设备专用文件等)根据需要从上层或下层文件系统中呈现。 当需要以写访问的方式访问下层文件系统中的文件时,例如写访问的方式打开、更改某些元数据等,该文件首先从下层文件系统复制到上层文件系统(向上拷贝,copy_up)。 请注意,创建硬链接也需要 copy_up,但创建符号链接并不需要。

The copy_up may turn out to be unnecessary, for example if the file is opened for read-write but the data is not modified.

copy_up 可能是不必要的,例如,如果文件以读写方式打开,但数据并未修改。

The copy_up process first makes sure that the containing directory exists in the upper filesystem - creating it and any parents as necessary. It then creates the object with the same metadata (owner, mode, mtime, symlink-target etc.) and then if the object is a file, the data is copied from the lower to the upper filesystem. Finally any extended attributes are copied up.

copy_up 进程首先确保文件所在目录存在于上层文件系统中 - 创建它以及根据需要创建任何父目录。 然后,它使用相同的元数据(所有者、模式、修改时间、符号链接目标等)创建对象,然后如果该对象是文件,则将数据从下层文件系统复制到上层文件系统。 最后,所有扩展属性都会被复制。

Once the copy_up is complete, the overlay filesystem simply provides direct access to the newly created file in the upper filesystem - future operations on the file are barely noticed by the overlay filesystem (though an operation on the name of the file such as rename or unlink will of course be noticed and handled).

一旦 copy_up 完成,覆盖文件系统只提供对上层文件系统中新创建的文件的直接访问 - 覆盖文件系统几乎不会注意到未来对该文件的操作(尽管对文件名的操作,例如重命名或取消链接) 当然会被注意到并处理)。

权限模型(Permission model)

Permission checking in the overlay filesystem follows these principles:

  1. permission check SHOULD return the same result before and after copy up

  2. task creating the overlay mount MUST NOT gain additional privileges

  3. non-mounting task MAY gain additional privileges through the overlay, compared to direct access on underlying lower or upper filesystems

Overlay文件系统中的权限检查遵循以下原则:

  1. 权限检查应该在复制之前和之后返回相同的结果

  2. 创建overlay挂载的任务不得获得额外的权限

  3. 与直接访问底层或上层文件系统相比,非挂载任务可以通过覆盖获得额外的权限

This is achieved by performing two permission checks on each access

a. check if current task is allowed access based on local DAC (owner, group, mode and posix acl), as well as MAC checks

b. check if mounting task would be allowed real operation on lower or upper layer based on underlying filesystem permissions, again including MAC checks

这是通过对每次访问执行两次权限检查来实现的

a. 根据本地 DAC(所有者、组、模式和 posix acl)以及 MAC 检查当前任务是否允许访问

b. 根据底层文件系统权限检查挂载任务是否允许在下层或上层进行实际操作,同样包括 MAC 检查

Check (a) ensures consistency (1) since owner, group, mode and posix acls are copied up. On the other hand it can result in server enforced permissions (used by NFS, for example) being ignored (3).

Check (b) ensures that no task gains permissions to underlying layers that the mounting task does not have (2). This also means that it is possible to create setups where the consistency rule (1) does not hold; normally, however, the mounting task will have sufficient privileges to perform all operations.

检查 (a) 确保一致性 (1),因为所有者、组、模式和 posix acl 被复制。 另一方面,它可能会导致服务器强制执行的权限(例如,由 NFS 使用)被忽略 (3)。

检查 (b) 确保没有任务获得挂载任务所没有的底层权限(2)。 这也意味着可以创建不满足一致性规则 (1) 的设置; 但是,通常情况下,挂载任务将具有足够的权限来执行所有操作。

Another way to demonstrate this model is drawing parallels between

演示该模型的另一种方法是类比以下两种方式:

1mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged

and

1cp -a /lower /upper && mount --bind /upper /merged

(译者注:原文是cp -a /lower /upper mount --bind /upper /merged , 我认为是漏了&&)

The resulting access permissions should be the same. The difference is in the time of copy (on-demand vs. up-front).

两种方式生成的访问权限应该是相同的。 区别在于发生复制的时间(按需复制与预先复制)。

多个下层(Multiple lower layers)

Multiple lower layers can now be given using the colon (":") as a separator character between the directory names. For example:

可以使用冒号(“:”)作为目录名称之间的分隔符来指定多个下层。 例如:

1mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged

As the example shows, "upperdir=" and "workdir=" may be omitted. In that case the overlay will be read-only.

The specified lower directories will be stacked beginning from the rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer.

如示例所示,“upperdir=”和“workdir=”可以省略。 在这种情况下,overlay fs将是只读的。

指定的下级目录将从最右边的目录开始向左堆叠。 在上面的例子中,lower1 是顶层,lower2 是中间层,lower3 是底层。

仅向上拷贝元数据 (Metadata only copy up)

When metadata only copy up feature is enabled, overlayfs will only copy up metadata (as opposed to whole file), when a metadata specific operation like chown/chmod is performed. Full file will be copied up later when file is opened for WRITE operation.

In other words, this is delayed data copy up operation and data is copied up when there is a need to actually modify data.

当启用仅向上拷贝元数据功能时,当执行 chown/chmod 等元数据特定操作时,overlayfs 将仅复制元数据(而不是整个文件)。 当文件打开进行写操作时,完整的文件将被复制。

换句话说,这是延迟数据复制操作,当需要实际修改数据时才复制数据。

There are multiple ways to enable/disable this feature. A config option CONFIG_OVERLAY_FS_METACOPY can be set/unset to enable/disable this feature by default. Or one can enable/disable it at module load time with module parameter metacopy=on/off. Lastly, there is also a per mount option metacopy=on/off to enable/disable this feature per mount.

有多种方法可以启用/禁用此功能。 可以通过设置/取消设置配置选项 CONFIG_OVERLAY_FS_METACOPY 以默认启用/禁用此功能。 或者可以在模块加载时使用模块参数 metacopy=on/off 启用/禁用它。 最后,可以在每个挂载时指定挂载选项 metacopy=on/off 来启用/禁用功能。

Do not use metacopy=on with untrusted upper/lower directories. Otherwise it is possible that an attacker can create a handcrafted file with appropriate REDIRECT and METACOPY xattrs, and gain access to file on lower pointed by REDIRECT. This should not be possible on local system as setting "trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible for untrusted layers like from a pen drive.

不要对不受信任的上层/下层目录使用metacopy=on。 否则,攻击者可能会使用适当的 REDIRECT 和 METACOPY xattrs 手工创建文件,并获得 REDIRECT 指向的下层文件的访问权限。 这在本地系统上应该是不可能的,因为设置"trusted." xattrs 将需要 CAP_SYS_ADMIN 权限。 但对于不受信任的层(例如pen drive)应该是可能的。

Note: redirect_dir={off|nofollow|follow[*]} and nfs_export=on mount options conflict with metacopy=on, and will result in an error.

[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is given.

注意:redirect_dir={off|nofollow|follow[*]} 和 nfs_export=on 挂载选项与 metacopy=on 冲突,并会导致错误。

[*] 如果给出了 upperdir=...,redirect_dir=follow 仅与 metacopy=on 冲突。

只含数据的下层(Data-only lower layers)

With "metacopy" feature enabled, an overlayfs regular file may be a composition of information from up to three different layers:

  1. metadata from a file in the upper layer
  2. st_ino and st_dev object identifier from a file in a lower layer
  3. data from a file in another lower layer (further below)

启用“元复制”功能后,overlayfs 常规文件可能是来自最多三个不同层的信息的组合:

  1. 来自上层文件的元数据
  2. 下层文件中的 st_ino 和 st_dev 对象标识符
  3. 来自另一个下层(更下面)的文件的数据

The "lower data" file can be on any lower layer, except from the top most lower layer.

Below the top most lower layer, any number of lower most layers may be defined as "data-only" lower layers, using double colon ("::") separators. A normal lower layer is not allowed to be below a data-only layer, so single colon separators are not allowed to the right of double colon ("::") separators.

For example:

“下层数据”文件可以位于任何下层,除了最顶层的下层。

在最顶层的下层之下,可以使用双冒号(“::”)分隔符将任意数量的最下层定义为“仅数据”下层。 普通下层不允许位于“仅数据”层下方,因此单冒号分隔符不允许位于双冒号(“::”)分隔符的右侧。

例如:

1mount -t overlay overlay -olowerdir=/l1:/l2:/l3::/do1::/do2 /merged

The paths of files in the "data-only" lower layers are not visible in the merged overlayfs directories and the metadata and st_ino/st_dev of files in the "data-only" lower layers are not visible in overlayfs inodes.

Only the data of the files in the "data-only" lower layers may be visible when a "metacopy" file in one of the lower layers above it, has a "redirect" to the absolute path of the "lower data" file in the "data-only" lower layer.

“仅数据”下层中的文件路径在合并的overlayfs目录中不可见,并且“仅数据”下层中的文件的元数据和st_ino/st_dev在overlayfs inode中不可见。

当其上方的下层之一中的“元复制”文件,“重定向”到了“仅数据”下层中的“下层数据”文件的绝对路径时,“仅数据”下层中的文件的数据才有可能是可见的。

共享和复制层(Sharing and copying layers)

Lower layers may be shared among several overlay mounts and that is indeed a very common practice. An overlay mount may use the same lower layer path as another overlay mount and it may use a lower layer path that is beneath or above the path of another overlay lower layer path.

Using an upper layer path and/or a workdir path that are already used by another overlay mount is not allowed and may fail with EBUSY. Using partially overlapping paths is not allowed and may fail with EBUSY. If files are accessed from two overlayfs mounts which share or overlap the upper layer and/or workdir path the behavior of the overlay is undefined, though it will not result in a crash or deadlock.

Mounting an overlay using an upper layer path, where the upper layer path was previously used by another mounted overlay in combination with a different lower layer path, is allowed, unless the "inodes index" feature or "metadata only copy up" feature is enabled.

下层可以在多个overlay挂载之间共享,这其实是一种非常常见的做法。 一个overlay挂载可以使用与另一个overlay挂载相同的下层路径,并且它可以使用在另一个overlay下层路径之下或之上的下层路径。

不允许使用已被另一个overlay挂载使用的上层路径和/或 workdir 路径,并且可能会返回 EBUSY 错误。 不允许使用部分重叠的路径,它可能返回 EBUSY 错误。 如果从共享或重叠上层和/或workdir路径的两个overlayfs挂载访问文件,则覆盖的行为是未定义的,尽管它不会导致崩溃或死锁。

使用某一个上层路径挂载overlay,这个上层路径先前已经与不同的下层路径组合起来,被另一个挂载的overlay使用过了,这种情况是被允许的,除非启用了“inode 索引”功能或“仅向上复制元数据”功能 。

With the "inodes index" feature, on the first time mount, an NFS file handle of the lower layer root directory, along with the UUID of the lower filesystem, are encoded and stored in the "trusted.overlay.origin" extended attribute on the upper layer root directory. On subsequent mount attempts, the lower root directory file handle and lower filesystem UUID are compared to the stored origin in upper root directory. On failure to verify the lower root origin, mount will fail with ESTALE. An overlayfs mount with "inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem does not support NFS export, lower filesystem does not have a valid UUID or if the upper filesystem does not support extended attributes.

For "metadata only copy up" feature there is no verification mechanism at mount time. So if same upper is mounted with different set of lower, mount probably will succeed but expect the unexpected later on. So don't do it.

It is quite a common practice to copy overlay layers to a different directory tree on the same or different underlying filesystem, and even to a different machine. With the "inodes index" feature, trying to mount the copied layers will fail the verification of the lower root file handle.

通过“inodes索引”功能,在第一次挂载时,下层根目录的NFS文件句柄,以及下层文件系统的UUID,被编码并存储在上层根目录的“trusted.overlay.origin”扩展属性中。 在随后的挂载尝试中,会将下层根目录文件句柄和下层文件系统 UUID 与上层根目录中存储的源进行比较。 如果无法验证较低的根原点,挂载将失败并报 ESTALE 错误。 如果下层文件系统不支持 NFS 导出、下层文件系统没有有效的 UUID 或者上层文件系统不支持扩展属性,启用“inodes 索引”的 Overlayfs 挂载将会失败并报 EOPNOTSUPP 错误。

对于“仅向上复制元数据”功能,挂载时没有验证机制。 因此,如果相同上层的与不同的下层一起挂载,挂载可能会成功,但稍后会出现非预期的状况。 所以不要这样做。

将覆盖层复制到相同或不同底层文件系统上的不同目录树,甚至复制到不同的机器是很常见的做法。 使用“inodes索引”功能,尝试挂载复制的层将无法验证较低的根文件句柄。

非标准行为(Non-standard behavior)

Current version of overlayfs can act as a mostly POSIX compliant filesystem.

This is the list of cases that overlayfs doesn't currently handle:

a) POSIX mandates updating st_atime for reads. This is currently not done in the case when the file resides on a lower layer.

b) If a file residing on a lower layer is opened for read-only and then memory mapped with MAP_SHARED, then subsequent changes to the file are not reflected in the memory mapping.

c) If a file residing on a lower layer is being executed, then opening that file for write or truncating the file will not be denied with ETXTBSY.

当前版本的overlayfs 文件系统可以基本兼容POSIX。

这是overlayfs 当前不处理的情况列表:

a) POSIX 强制在读文件时更新 st_atime。 当文件驻留在较低层时,当前不执行此操作。

b) 如果驻留在下层的文件以只读方式打开,然后使用 MAP_SHARED 映射内存,则对该文件的后续更改不会反映在内存映射中。

c) 如果正在执行驻留在下层的文件,则打开该文件进行写入,或截断该文件将不会被以 ETXTBSY 错误拒绝。

The following options allow overlayfs to act more like a standards compliant filesystem:

  1. "redirect_dir"

Enabled with the mount option or module option: "redirect_dir=on" or with the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.

If this feature is disabled, then rename(2) on a lower or merged directory will fail with EXDEV ("Invalid cross-device link").

以下选项允许overlayfs更像一个符合标准的文件系统:

  1. “重定向目录”

使用挂载选项或模块选项启用:“redirect_dir=on”或使用内核配置选项 CONFIG_OVERLAY_FS_REDIRECT_DIR=y。

如果禁用此功能,则下层目录或合并目录上的 rename(2) 将失败并显示 EXDEV(“无效的跨设备链接”)。

  1. "inode index"

Enabled with the mount option or module option "index=on" or with the kernel config option CONFIG_OVERLAY_FS_INDEX=y.

If this feature is disabled and a file with multiple hard links is copied up, then this will "break" the link. Changes will not be propagated to other names referring to the same inode.

  1. "inode 索引"

使用挂载选项或模块选项“index=on”或使用内核配置选项 CONFIG_OVERLAY_FS_INDEX=y 启用。

如果禁用此功能并且复制具有多个硬链接的文件,则这将“破坏”链接。 更改不会传播到引用同一 inode 的其他名称。

  1. "xino"

Enabled with the mount option "xino=auto" or "xino=on", with the module option "xino_auto=on" or with the kernel config option CONFIG_OVERLAY_FS_XINO_AUTO=y. Also implicitly enabled by using the same underlying filesystem for all layers making up the overlay.

If this feature is disabled or the underlying filesystem doesn't have enough free bits in the inode number, then overlayfs will not be able to guarantee that the values of st_ino and st_dev returned by stat(2) and the value of d_ino returned by readdir(3) will act like on a normal filesystem. E.g. the value of st_dev may be different for two objects in the same overlay filesystem and the value of st_ino for filesystem objects may not be persistent and could change even while the overlay filesystem is mounted, as summarized in the Inode properties table above.

  1. “xino”

通过挂载选项“xino=auto”或“xino=on”、模块选项“xino_auto=on”或内核配置选项 CONFIG_OVERLAY_FS_XINO_AUTO=y 启用。 还可以通过对构成overlay的所有层使用相同的底层文件系统来隐式启用。

如果禁用此功能或底层文件系统的 inode 编号中没有足够的空闲位,则overlayfs将无法保证 stat(2) 返回的 st_ino 和 st_dev 的值以及 readdir(3) 返回的 d_ino 的值与在普通文件系统上的表现一致。 例如,对于同一overlay文件系统中的两个对象,st_dev 的值可能不同,并且文件系统对象的 st_ino 值可能不是持久的,即使在Overlay文件系统还处于挂载状态时也可能会发生变化,如上面的 Inode 属性表中总结的那样。

修改底层文件系统(Changes to underlying filesystems)

Changes to the underlying filesystems while part of a mounted overlay filesystem are not allowed. If the underlying filesystem is changed, the behavior of the overlay is undefined, though it will not result in a crash or deadlock.

Offline changes, when the overlay is not mounted, are allowed to the upper tree. Offline changes to the lower tree are only allowed if the "metadata only copy up", "inode index", "xino" and "redirect_dir" features have not been used. If the lower tree is modified and any of these features has been used, the behavior of the overlay is undefined, though it will not result in a crash or deadlock.

不允许修改已挂载的Overlay文件系统的底层文件系统。 如果底层文件系统发生更改,则overlay fs的行为是未定义的,尽管它不会导致崩溃或死锁。

当未挂载overlay fs时,允许对上层树进行离线更改。 仅当未使用“仅向上复制元数据”、“inode 索引”、“xino”和“redirect_dir”功能时,才允许对下部树进行离线更改。 如果修改了较低的树并且使用了任何这些功能,则overlay的行为是未定义的,尽管它不会导致崩溃或死锁。

When the overlay NFS export feature is enabled, overlay filesystems behavior on offline changes of the underlying lower layer is different than the behavior when NFS export is disabled.

On every copy_up, an NFS file handle of the lower inode, along with the UUID of the lower filesystem, are encoded and stored in an extended attribute "trusted.overlay.origin" on the upper inode.

When the NFS export feature is enabled, a lookup of a merged directory, that found a lower directory at the lookup path or at the path pointed to by the "trusted.overlay.redirect" extended attribute, will verify that the found lower directory file handle and lower filesystem UUID match the origin file handle that was stored at copy_up time. If a found lower directory does not match the stored origin, that directory will not be merged with the upper directory.

当启用overlay NFS 导出功能时,overlay文件系统对底层较低层的离线更改的行为与禁用 NFS 导出时的行为不同。

每次的向上拷贝(copy_up),下层 inode 的 NFS 文件句柄以及下层文件系统的 UUID 都被编码并存储在上层 inode 上的扩展属性“trusted.overlay.origin”中。

当启用 NFS 导出功能时,一次对合并目录的查找,如果通过路径找到下层目录,或通过“trusted.overlay.redirect”扩展属性指向的路径找到下层目录,将验证找到的下层目录文件句柄和下层文件系统 UUID 是否与 copy_up 时存储的原始文件句柄匹配。 如果找到的下级目录与存储的源不匹配,则该目录将不会与上级目录合并。

NFS导出(NFS export)

When the underlying filesystems supports NFS export and the "nfs_export" feature is enabled, an overlay filesystem may be exported to NFS.

With the "nfs_export" feature, on copy_up of any lower object, an index entry is created under the index directory. The index entry name is the hexadecimal representation of the copy up origin file handle. For a non-directory object, the index entry is a hard link to the upper inode. For a directory object, the index entry has an extended attribute "trusted.overlay.upper" with an encoded file handle of the upper directory inode.

When encoding a file handle from an overlay filesystem object, the following rules apply:

  1. For a non-upper object, encode a lower file handle from lower inode
  2. For an indexed object, encode a lower file handle from copy_up origin
  3. For a pure-upper object and for an existing non-indexed upper object, encode an upper file handle from upper inode

当底层文件系统支持 NFS 导出并且启用“nfs_export”功能时,可以将overlay文件系统导出到 NFS。

使用“nfs_export”功能,在复制任何下层对象时,会在索引目录下创建索引条目。 索引条目名称是复制源文件句柄的十六进制表示形式。 对于非目录对象,索引项是到上层 inode 的硬链接。 对于目录对象,索引条目具有扩展属性“trusted.overlay.upper”,其中包含上层目录 inode 的编码文件句柄。

当从overlay文件系统对象编码文件句柄时,适用以下规则:

  1. 对于非上层对象,从下层inode编码出下层文件句柄
  2. 对于索引对象,从 copy_up 源编码下层文件句柄
  3. 对于纯上层对象和现有的非索引上层对象,从上层 inode 编码上层文件句柄

The encoded overlay file handle includes:

  • Header including path type information (e.g. lower/upper)
  • UUID of the underlying filesystem
  • Underlying filesystem encoding of underlying inode

This encoding format is identical to the encoding format file handles that are stored in extended attribute "trusted.overlay.origin".

编码的overlay文件句柄包括:

  • 包含路径类型信息的标头(例如下层/上层)
  • 下层文件系统的UUID
  • 下层 inode 的底层文件系统编码

此编码格式与存储在扩展属性“trusted.overlay.origin”中的文件句柄编码格式相同。

When decoding an overlay file handle, the following steps are followed:

  1. Find underlying layer by UUID and path type information.
  2. Decode the underlying filesystem file handle to underlying dentry.
  3. For a lower file handle, lookup the handle in index directory by name.
  4. If a whiteout is found in index, return ESTALE. This represents an overlay object that was deleted after its file handle was encoded.
  5. For a non-directory, instantiate a disconnected overlay dentry from the decoded underlying dentry, the path type and index inode, if found.
  6. For a directory, use the connected underlying decoded dentry, path type and index, to lookup a connected overlay dentry.

解码overlay文件句柄时,遵循以下步骤:

  1. 通过UUID和路径类型信息找到下层。
  2. 将下层文件系统文件句柄解码为下层dentry(目录项)。
  3. 对于下层文件句柄,在索引目录中按名称查找该句柄。
  4. 如果在索引中发现whiteout,则返回 ESTALE。 这代表其文件句柄编码后overlay对象被删除。
  5. 对于非目录,从解码的底层目录项、路径类型和索引 inode实例化断开连接的overlay dentry(如果找到)。
  6. 对于目录,使用连接的底层解码目录项、路径类型和索引来查找连接的overlay dentry。

Decoding a non-directory file handle may return a disconnected dentry. copy_up of that disconnected dentry will create an upper index entry with no upper alias.

When overlay filesystem has multiple lower layers, a middle layer directory may have a "redirect" to lower directory. Because middle layer "redirects" are not indexed, a lower file handle that was encoded from the "redirect" origin directory, cannot be used to find the middle or upper layer directory. Similarly, a lower file handle that was encoded from a descendant of the "redirect" origin directory, cannot be used to reconstruct a connected overlay path. To mitigate the cases of directories that cannot be decoded from a lower file handle, these directories are copied up on encode and encoded as an upper file handle. On an overlay filesystem with no upper layer this mitigation cannot be used NFS export in this setup requires turning off redirect follow (e.g. "redirect_dir=nofollow").

解码非目录文件句柄可能会返回断开连接的目录项。 该断开连接的 dentry 的 copy_up 将创建一个没有上层别名的上层索引条目。

当overlay文件系统具有多个下层时,中间层目录可能会“重定向”到下层目录。 因为中间层“重定向”没有索引,所以从“重定向”原始目录编码的较低文件句柄不能用于查找中间层或上层目录。 类似地,从“重定向”原始目录的后代编码的较低文件句柄不能用于重建连接的overlay路径。 为了缓解无法从较低文件句柄解码目录的情况,这些目录在编码时被复制并编码为较高层文件句柄。 在没有上层的覆盖文件系统上,无法使用此缓解措施。此设置中的 NFS 导出需要关闭重定向跟踪(例如“redirect_dir=nofollow”)。

The overlay filesystem does not support non-directory connectable file handles, so exporting with the 'subtree_check' exportfs configuration will cause failures to lookup files over NFS.

When the NFS export feature is enabled, all directory index entries are verified on mount time to check that upper file handles are not stale. This verification may cause significant overhead in some cases.

Note: the mount options index=off,nfs_export=on are conflicting for a read-write mount and will result in an error.

Note: the mount option uuid=off can be used to replace UUID of the underlying filesystem in file handles with null, and effectively disable UUID checks. This can be useful in case the underlying disk is copied and the UUID of this copy is changed. This is only applicable if all lower/upper/work directories are on the same filesystem, otherwise it will fallback to normal behaviour.

覆盖文件系统不支持非目录可连接文件句柄,因此使用“subtree_check”exportfs 配置导出将导致通过 NFS 查找文件失败。

启用 NFS 导出功能后,所有目录索引条目都会在挂载时进行验证,以检查上层文件句柄是否已过时。 在某些情况下,此验证可能会导致显著的开销。

注意:挂载选项index=off,nfs_export=on对于读写挂载是冲突的,并且会导致错误。

注意:挂载选项 uuid=off 可用于将文件句柄中底层文件系统的 UUID 替换为 null,并有效禁用 UUID 检查。 如果复制了下层磁盘并且该副本的 UUID 发生了更改,这会很有用。 仅当所有下层/上层/工作目录位于同一文件系统上时才适用,否则它将恢复正常行为。

易失性挂载(Volatile mount)

This is enabled with the "volatile" mount option. Volatile mounts are not guaranteed to survive a crash. It is strongly recommended that volatile mounts are only used if data written to the overlay can be recreated without significant effort.

The advantage of mounting with the "volatile" option is that all forms of sync calls to the upper filesystem are omitted.

这是通过“易失性”挂载选项启用的。无法保证易失性挂载能够在崩溃中幸存下来。 强烈建议仅当无需付出很大努力即可重新创建写入overlay fs的数据时,才使用易失性挂载。

使用“易失性”选项挂载的优点是省略了对上层文件系统的所有形式的sync调用。

In order to avoid a giving a false sense of safety, the syncfs (and fsync) semantics of volatile mounts are slightly different than that of the rest of VFS. If any writeback error occurs on the upperdir's filesystem after a volatile mount takes place, all sync functions will return an error. Once this condition is reached, the filesystem will not recover, and every subsequent sync call will return an error, even if the upperdir has not experience a new error since the last sync call.

When overlay is mounted with "volatile" option, the directory "$workdir/work/incompat/volatile" is created. During next mount, overlay checks for this directory and refuses to mount if present. This is a strong indicator that user should throw away upper and work directories and create fresh one. In very limited cases where the user knows that the system has not crashed and contents of upperdir are intact, The "volatile" directory can be removed.

为了避免给人一种错误的安全感,易失性挂载的syncfs(和fsync)语义与VFS其余部分的语义略有不同。 如果在易失性挂载上层目录的文件系统上发生任何回写错误,则所有sync函数都将返回错误。 一旦达到此条件,文件系统将无法恢复,并且每个后续sync调用都将返回错误,即使上层目录自上次sync调用以来没有遇到新错误也是如此。

当使用“易失性”选项挂载overlay时,将创建目录“$workdir/work/incompat/volatile”。 在下次挂载时,overlay检查此目录并拒绝挂载(如果存在)。 这是一个强有力的指标,表明用户应该丢弃上层目录和工作目录并创建新的目录。 在非常有限的情况下,用户知道系统没有崩溃并且上层目录的内容完好无损,可以删除“volatile”目录。

用户扩展属性(User xattr)

The "-o userxattr" mount option forces overlayfs to use the "user.overlay." xattr namespace instead of "trusted.overlay.". This is useful for unprivileged mounting of overlayfs.

“-o userxattr”挂载选项强制overlayfs使用“user.overlay.” xattr 命名空间而不是“trusted.overlay.”。 这对于非特权挂载overlayfs 很有用。

测试套(Testsuite)

There's a testsuite originally developed by David Howells and currently maintained by Amir Goldstein at:

有一个测试套件最初由 David Howells 开发,目前由 Amir Goldstein 维护,网址为:

https://github.com/amir73il/unionmount-testsuite.git

Run as root:

以Root用户运行:

1# cd unionmount-testsuite
2# ./run --ov --verify