RAS：Intel MCA-CMCI，你了解嗎？

2023-07-11 11:23:26來源：Linux閱碼場

Corrected machine-check error interrupt (CMCI)是MCA的增強(qiáng)特性，它提供了一種threshold-based的錯誤上報方式。這種模式下，軟件可以配置硬件corrected MC errors的閾值，硬件發(fā)生CE（Corrected Error）次數(shù)達(dá)到閾值后，會產(chǎn)生一個中斷通知到軟件處理。

值得一提的是，CMCI是隨MCA加入的特性，最開始只能通過軟件輪詢方式獲取CE信息。CMCI中斷通知方式的優(yōu)點(diǎn)是每個CE都會經(jīng)過IRQ Handle處理，不會丟失任一CE；而輪詢方式可能因?yàn)檩喸冾l率低、存儲空間有限等原因，導(dǎo)致丟失CE。但是并不是說CMCI最優(yōu)，CMCI的缺點(diǎn)是大量CE會產(chǎn)生中斷風(fēng)暴，影響機(jī)器的性能。不幸的是在云服務(wù)器場景，CE風(fēng)暴是比較常見的，那么當(dāng)下Intel服務(wù)器是如何解決這個問題的呢？下面會講到。

CMCI機(jī)制

CMCI默認(rèn)是關(guān)閉的，軟件需要通過配置IA32_MCG_CAP[10] = 1打開。

(相關(guān)資料圖)

軟件通過IA32_MCi_CTL2 MSR來控制對應(yīng)Bank使能/關(guān)閉CMCI功能。

通過IA32_MCi_CTL2 Bit 14:0設(shè)置閾值，如果設(shè)置非0，則使用配置的閾值；如果CMCI不支持，則全0；

CMCI機(jī)制如下圖

圖片

硬件通過比較IA32_MCi_CTL2 Bit 14:0和IA32_MCi_STATUS Bit 52:38，如果數(shù)值相等，那么overflow event發(fā)送到APIC的CMCI LVT entry。如果MC error涉及多個processors，那么CMCI中斷會同時發(fā)送到這些processors，比如2個cpu共享的cache發(fā)生CE，那么這兩個cpu都會收到CMCI。

CMCI初始化

以Linux v6.3分支為例，內(nèi)核使能CMCI代碼

C++arch/x86/kernel/cpu/mce/intel.cvoid intel_init_cmci(void){int banks;

if (!cmci_supported(&banks))            return;    mce_threshold_vector = intel_threshold_interrupt;    cmci_discover(banks);    /*     * For CPU #0 this runs with still disabled APIC, but that"s     * ok because only the vector is set up. We still do another     * check for the banks later for CPU #0 just to make sure     * to not miss any events.     */    apic_write(APIC_LVTCMCI, THRESHOLD_APIC_VECTOR|APIC_DM_FIXED);    cmci_recheck();    }

1.cmci_supported()函數(shù)主要事項(xiàng)包括

?根據(jù)內(nèi)核啟動參數(shù)"mce=no_cmci,ignore_ce"判斷是否打開cmci和ce上報功能

?檢查硬件是否支持cmci

?通過MCG_CMCI_P bit判斷硬件是否使能cmci功能

2.mce_threshold_vector = intel_threshold_interrupt; 聲明cmci的中斷處理函數(shù)為intel_threshold_interrupt();

3.cmci_discover()函數(shù)主要完成

?遍歷所有banks，通過配置IA32_MCi_CTL2寄存器使能所有bank的cmci功能；

C++rdmsrl(MSR_IA32_MCx_CTL2(i), val);...

val |= MCI_CTL2_CMCI_EN;            wrmsrl(MSR_IA32_MCx_CTL2(i), val);            rdmsrl(MSR_IA32_MCx_CTL2(i), val);

?設(shè)置cmci threshold值，代碼如下

C++#define CMCI_THRESHOLD 1

if (!mca_cfg.bios_cmci_threshold) {                    val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;                    val |= CMCI_THRESHOLD;            } else if (!(val & MCI_CTL2_CMCI_THRESHOLD_MASK)) {                    /*                     * If bios_cmci_threshold boot option was specified                     * but the threshold is zero, we"ll try to initialize                     * it to 1.                     */                    bios_zero_thresh = 1;                    val |= CMCI_THRESHOLD;            }

如果用戶未通過啟動參數(shù)"mce=bios_cmci_threshold"配置值，則val = CMCI_THRESHOLD，為1；

如果啟動參數(shù)"mce=bios_cmci_threshold"配置，那么表示bios已配置threshold值，即val & MCI_CTL2_CMCI_THRESHOLD_MASK不為0，跳過else if判斷，采用bios配置值；如果bios未配置值，val & MCI_CTL2_CMCI_THRESHOLD_MASK為0，那么驅(qū)動初始化threshold為1。

4.cmci_recheck()

cmci_recheck函數(shù)通過調(diào)用machine_check_poll()，檢查CPU #0是否有遺漏的CE&UCE events。

CMCI處理

cmci中斷處理函數(shù)為intel_threshold_interrupt()，定義在arch/x86/kernel/cpu/mce/intel.c

C++/* * The interrupt handler. This is called on every event. * Just call the poller directly to log any events. * This could in theory increase the threshold under high load, * but doesn"t for now. */static void intel_threshold_interrupt(void){        if (cmci_storm_detect())                return;        machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));}machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));

1.cmci_storm_detect()函數(shù)主要是對cmci storm的處理，代碼如下

C++static bool cmci_storm_detect(void){unsigned int cnt = __this_cpu_read(cmci_storm_cnt);unsigned long ts = __this_cpu_read(cmci_time_stamp);unsigned long now = jiffies;int r;

if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)            return true;    if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {            cnt++;    } else {            cnt = 1;            __this_cpu_write(cmci_time_stamp, now);    }    __this_cpu_write(cmci_storm_cnt, cnt);    if (cnt <= CMCI_STORM_THRESHOLD)            return false;    cmci_toggle_interrupt_mode(false);    __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);    r = atomic_add_return(1, &cmci_storm_on_cpus);    mce_timer_kick(CMCI_STORM_INTERVAL);    this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);    if (r == 1)            pr_notice("CMCI storm detected: switching to poll mode\n");    return true;    }

該函數(shù)通過jiffies，判斷固定時間內(nèi)發(fā)生的cmci次數(shù)是否大于CMCI_STORM_THRESHOLD（15），如果否則return，反之說明發(fā)生cmci storm，則執(zhí)行cmci_toggle_interrupt_mode()關(guān)閉cmci功能，切換為poll mode，通過輪詢方式獲取event；

2.非cmci storm情況下，通過machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned))函數(shù)獲取并記錄故障信息

參數(shù)1定義如下，MCP_TIMESTAMP表示會記錄當(dāng)前TSC

C++enum mcp_flags {        MCP_TIMESTAMP   = BIT(0),       /* log time stamp */        MCP_UC          = BIT(1),       /* log uncorrected errors */        MCP_DONTLOG     = BIT(2),       /* only clear, don"t log */};

machine_check_poll函數(shù)主要功能是通過讀取IA32_MCG_STATUS、IA32_MCi_STATUS寄存器信息和CPU的ip、cs等相關(guān)信息，然后故障分類，將CE event或其他故障類型event記錄到/dev/mcelog。用戶可以通過讀取/dev/mcelog獲取錯誤記錄。

執(zhí)行流程如下，過程說明在代碼注釋中

C++bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b){        if (flags & MCP_TIMESTAMP)                m.tsc = rdtsc(); // 記錄當(dāng)前TSC/*CE Error記錄*/                /* If this entry is not valid, ignore it */                if (!(m.status & MCI_STATUS_VAL))                        continue;                /*                 * If we are logging everything (at CPU online) or this                 * is a corrected error, then we must log it.                 */                if ((flags & MCP_UC) || !(m.status & MCI_STATUS_UC))                        goto log_it;/*UCNA Error記錄*/                /*                 * Log UCNA (SDM: 15.6.3 "UCR Error Classification")                 * UC == 1 && PCC == 0 && S == 0                 */                if (!(m.status & MCI_STATUS_PCC) && !(m.status & MCI_STATUS_S))                        goto log_it;/*通過mce_log記錄故障信息*/log_it:                         /*                 * Don"t get the IP here because it"s unlikely to                 * have anything to do with the actual error location.                 */                if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)                        mce_log(&m);                else if (mce_usable_address(&m)) {                        /*                         * Although we skipped logging this, we still want                         * to take action. Add to the pool so the registered                         * notifiers will see it.                         */                        if (!mce_gen_pool_add(&m))                                mce_schedule_work();                }        }

總結(jié)一下，CMCI是MCA的一個增強(qiáng)特性，主要用于將硬件CE、UCNA等類型故障通過中斷方式上報到軟件，軟件收到中斷后，執(zhí)行中斷處理函數(shù)intel_threshold_interrupt()采取irq mode或poll mode記錄錯誤信息到/dev/mcelog，用戶態(tài)可以通過/dev/mcelog獲取硬件故障信息。

參考文檔：《Intel? 64 and IA-32 Architectures Software Developer’s Manual 》

關(guān)鍵詞：