APP下载

Effect of training on resident inter-reader agreement with American College of Radiology Thyroid Imaging Reporting and Data System

2022-02-12YangDuMeredithBaraPrayashKatlariwalaRogerCroutzeKatrinReschJonathanPorterMedicaSamMitchellWilsonGavinLow

World Journal of Radiology 2022年1期

lNTRODUCTlON

Thyroid nodules are detected in more than 50% of healthy individuals with approximately 95% representing asymptomatic incidental nodules[1-3].Moreover, an increasing number of thyroid nodules are being detected in recent years on account of improved quality and increased frequency of medical imaging[4].Although most thyroid nodules are benign and do not require treatment, adequate characterization is necessary in order to identify potentially malignant nodules[1-3].The American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) was therefore introduced to standardize the ultrasound characterization of thyroid nodules based on 5 morphologic categories (composition, echogenicity, shape, margins, and echogenic foci).A TI-RADS score is obtained to represent the level of suspicion for cancer and further direct the need for follow-up and/or tissue sampling[5].First published in 2017, ACR TI-RADS has been widely adopted by many centers worldwide.Studies have shown that ACR-TIRADS reduces unnecessary biopsies and improves consistency of imaging recommendations[6,7].

Despite its widespread adoption, there are few studies available to date assessing the inter-reader reliability of TI-RADS amongst radiology trainees with limited ultrasound experience.A single-institutional study performed in China by Teng

[8] evaluated three trainees with less than three months of ultrasound experience, demonstrating fair to almost perfect agreement amongst readers for TI-RADS categorization, with improved agreement and diagnostic accuracy after training.To our knowledge, no similar inter-reader agreement studies have been performed in North American trainees.The purpose of this study is to evaluate the inter-reader reliability amongst radiology trainees before and after designated TI-RADS training in a North American institution.

MATERlALS AND METHODS

This retrospective, single-institution observational study was approved by the institutional Health Research Ethics Board (Pro 00104708).This study was exempted from obtaining informed consent.A retrospective review of the local Picture Archiving and Communication System (PACS) was performed to identify thyroid ultrasound studies containing thyroid nodules between July 1, 2019 to July 31, 2020.Included cases required at least 1 thyroid nodule (minimal dimension of 5 mm) with both transverse and sagittal still images and cine video recording in at least 1 plane.Nodules with non-diagnostic image quality, incomplete nodule visualization, and absence of a cine clip covering the entirety of the nodule were excluded.The type of ultrasound make, model, or platform were not considered in the selection process.

Three PGY-4 radiology residents (trainees) were selected as blinded readers for this study.Each trainee had between 4 to 5 mo of designated ultrasound training, in addition to non-designated ultrasound training on other rotations throughout their training.No trainee had received specialized TI-RADS training prior to this study.Each of the readers independently reviewed the 50 testing cases and assigned TIRADS score to each case.The readers were provided with a summary chart detailing the ACR TI-RADS classification as described in the ACR TI-RADS White Paper and had access to an online TI-RADS calculator (https://tiradscalculator.com) at the time of independent review[5].The readers were instructed to assign TI-RADS points for each category including composition, echogenicity, shape, margins, echogenic foci, and to determine the TI-RADS level and ACR TI-RADS recommendations.The pretraining responses were entered into an online survey generated

Google Forms.Four weeks after the readers had completed the pre-training assessment; a one hourlong teaching session including a Microsoft PowerPoint presentation illustrating important features of ACR TI-RADS was provided to the readers along with a Microsoft Word document summarizing common areas of disagreement in nodule characterization[5].The teaching session provided a step-by-step review of the 5 main sonographic features used for nodule scoring in ACR TI-RADS: (1) Composition; (2)

All patient identifiers were removed apart from age and gender.All cases were evaluated by a consensus review of 3 independent fellowship-trained board-certified staff radiologists with between 1 and 14 years of clinical experience each (GL, MW, MS).Any disagreement on the scoring of nodules for the ACR TI-RADS level was resolved by re-review and consensus discussion.Findings on the consensus review were recorded and set as the standard of reference.This approach has been used in other recent inter-reader reliability studies assessing ACR Reporting and Data Systems[9].

Eighty consecutive thyroid nodules meeting eligibility criteria were selected by 2 authors (YD, 6 years clinical experience; MB, 3 years clinical experience) from the eligible ultrasound examinations.A single case could include more than one nodule if sufficient imaging was available to meet inclusion criteria for multiple nodules.Still images of each nodule in both transverse and sagittal planes as well as at least 1 cine video clip of the nodule were saved in a teaching file hosted on our institutional Picture Archiving and Communication System.Each nodule and its representative images/cine clips were saved separately.If a single patient had two nodules, the relevant images and cine clips for each nodule were saved as separate case numbers.Of these, 50 cases were allocated into the “testing” group and 30 cases into the “training” group.Non-random group selection was performed to allow an approximately even distribution of TI-RADS categories within each group and to prevent under-representation of any category.A steering committee consisting of 2 authors including the principal investigator (YD, MB) attempted to evenly divide cases of differentiating difficulty equally between “testing” and “training” groups.This variable approach was selected over a pathological gold standard in an attempt to reduce referral bias in the “testing” group, a situation likely encountered by Teng

[8] where 61% (245/400) of included nodules were pathologically malignant.The trainees were blinded to the distribution approach of the “testing” group.

Echogenicity; (3) Shape; (4) Margin; and (5) Echogenic foci.Each feature’s description and interpretation was discussed and illustrated by examples.The readers were given ample opportunity to ask questions, and the consensus panel provided focused clarification to readers in areas of reader uncertainty.Additionally, the trainees were instructed to review the training file that contained the 30 training cases on PACS and corresponding answers were provided for each case.Two weeks after the training session (six weeks after the pre-training assessment), the 50 anonymized cases from the ‘’testing’’ group were re-sent to the readers for independent review.Readers were instructed to re-score the 50 cases and the post-training responses were entered into an online survey generated

Google Forms.

Statistical analysis

So you may guess how astonished they were one day, when having at last been successful after their long and weary chase, they cried aloud at the same instant: At last I have saved my beloved, and then recognising each other s voice looked up, and rushed to meet one another with the wildest joy

Don t you know, young lady,” he said harshly2, “when you give someone a present there s supposed to be something inside the package! The little girl looked up at him with tears rolling from her eyes and said: Daddy, it s not empty

Fleiss kappa (overall agreement) was used to calculate the pooled inter-reader agreement.The kappa (K) value interpretation as suggested by Cohen was used: ≤ 0.20 (slight agreement), 0.21-0.40 (fair agreement), 0.41-0.60 (moderate agreement), 0.61-0.80 (substantial agreement), and 0.81-1.00 (almost perfect agreement)[10].

Paired

-test was used to evaluate for significant difference between agreement coefficients[11].

Using the consensus panel as the reference standard, the relative diagnostic parameters (sensitivity, specificity, positive predictive value and negative predictive value) per TI-RADS level were calculated for individual readers and on a pooled basis.

RESULTS

The testing cases comprised of 50 nodules in 40 patients.There were 33 (82.5%) females and 7 males.The mean patient age was 56.6 ± 13.6 years with an age range from 29 to 80 years.Of the 50 nodules, 31 (62%) were located in the right lobe, 18 (36%) in the left lobe and 1 (2%) in the isthmus.The mean nodule size was 19 ± 14 mm with a range from 5 to 63 mm.According to the reference standard that consisted of a consensus panel of 3 fellowship trained staff radiologists, there were 11 (22%) TIRADS level 1 nodules, 9 (18%) TI-RADS level 2 nodules, 9 (18%) TI-RADS level 3 nodules, 13 (26%) TI-RADS level 3 nodules, and 8 (16%) TI-RADS level 5 nodules.

Overall, the current study demonstrates a statistically significant improvement in inter-reader agreement among radiology residents, with no prior ACR TI-RADS experience, in the assignment of TI-RADS level and recommendations after a single didactic teaching session compared to expert consensus.Our study demonstrates the learnability of the ACR TI-RADS system and supports the use of dedicated training in radiology residents.Future studies can also be directed to evaluate the effect of additional training sessions with focus on areas/features demonstrating lower interrater agreement such as “margins” and retention of training over time.

The pooled inter-reader agreement with the reference standard, pre- and posttraining, is listed in Table 1.A statistically significant improvement in reader agreement was demonstrated in post-training inter-reader agreement for nodule shape (

0.001), presence of echogenic foci (

= 0.004), TI-RADS level (

0.001) and overall recommendation (

= 0.02).Each of these categories improved at least one category of agreement.Only margin characterization remained at slight agreement after training.Similarly, the percentage reader agreement with the reference standard for sonographic features (Table 2), TI-RADS levels (Table 3) and recommendations (Table 4) are also included.Figure 1 provides an illustrated example of complete reader concordance for nodule scoring using ACR TI-RADS.In contrast, Figure 2 provides an illustrated example where there is discordance in reader scoring using ACR TI-RADS.

Categorical variables were expressed as values and percentages.Continuous variables were expressed as the mean ± SD.The following statistical tests were used:

The overall inter-reader agreement for ACR TI-RADS should take into account the inter-reader agreement of its two major outcome variables - '

and

.In our study, the inter-reader agreement for

showed a significant improvement with training (

0.14 (slight) on the pre-training assessment

0.36 (fair) on the post-training assessment)[12].Our inter-reader agreement for

also showed a significant improvement with training (

0.36 (fair) on the pre-training assessment

0.50 (moderate) on the post-training assessment [P = 0.02]).Our findings suggest that even a single didactic training session can significantly improve the overall inter-reader agreement in radiology residents.Our findings compare favorably with other inter-reader agreement studies involving ACR TI-RADS.A study by Hoang

[7] involving 8 board certified radiologists (2 from academic centers with subspecialty training in US and 6 from private practice with no subspecialty training in US) found a fair (

0.35) inter-reader agreement for

, and moderate (

051) inter-reader agreement for

[7].Teng

[8] assessed the learnability and reproducibility of ACR TI-RADS in post-graduate freshmen.The study included 3 readers with < 3 mo ultrasound experience and 3 experts with > 15 years ultrasound experience each.The readers independently evaluated 4 groups of noduleswith 50 nodules per group.After evaluating each group, a post-group training session was carried out for the freshman.The study found that the inter-reader agreement improved with training.Chung

[13] performed a study evaluating the impact of radiologist’s experience on ACR TI-RADS.Six fellowship-trained radiologists were divided into two groups (experienced

less experienced) with the experienced group having at least 20 years of post-fellowship experience each and the less experienced group having 1 year or less of post-fellowship experience each.The study found no significant differences for inter-reader agreement between experienced

less experienced readers for

or

.The interreader agreement was moderate for both experienced and less experienced groups for ‘

’ and moderate to substantial (experienced

less experienced, respectively) for ‘

’.Seifert

[14] evaluated the interreader agreement and efficacy of consensus reading for several thyroid imaging risk stratification systems including ACR TI-RADS.The study involved 4 experiencedspecialist readers with more than 5 years of clinical experience each.The readers independently scored 40 thyroid image datasets in session 1 followed by a joint consensus read (C1).After this, the process was repeated with independent scoring of 40 new image datasets in session 2, followed by another consensus read (C2).For ACR TI-RADS, the study found a significantly higher inter-reader agreement for session 2 (

0.57, moderate)

session 1 (

0.32, fair) [

0.01], indicating that the addition of a consensus read had an impact in improving the inter-reader agreement.

DlSCUSSlON

Finally, the relative diagnostic performance of readers, pre- and post-training, when compared against the reference standard is included in Table 5 and Table 6, respectively.Pre-training pooled sensitivities ranged from 22.3%-66.7% and pooled specificity ranged from 72.2%-95.1%, dependent on TI-RADS category.Post-training pooled sensitivities ranged from 40.7%-63% and pooled specificity ranged from 76.6%-96.8%, dependent on TI-RADS category.

Our study also evaluated the inter-reader agreement of individual sonographic features including composition, echogenicity, shape, margins, and echogenic foci.Our findings showed a significant improvement in inter-reader agreement with training for features such as ‘

’ (

0.09, slight

versus

0.67, substantial

,

0.001) and

(

0.28, fair

versus

0.45, moderate

,

= 0.004) but not for the others.The features with the strongest inter-reader agreement in our study were ‘

’ (

0.67

, substantial) and ‘composition’ (

0.52

, moderate).Hoang

[7] also found similar findings in their study with ‘shape’ (

0.61, substantial) and ‘composition’ (

0.58, moderate) having the strongest interreader agreement amongst the 5 principal sonographic features.The feature with the poorest inter-reader agreement in our study was margins (

0.05

, slight).Similarly, Hoang

[7] also found that ‘margins’ had the poorest inter-reader agreement (

0.25, fair) in their study.The poor inter-reader agreement for ‘margins’ is not surprising as accurate assessment requires a thorough review of the entire cine clip, rather than review of the still images only.Margins may also be harder to interpret through ultrasound artifacts.Finally, two of the available answer options for ‘margins’ in ACR TI-RADS are ‘ill defined’ (TI-RADS + 0 points) and ‘irregular’ (TIRADS + 2 points).However, both options share innate conceptual similarities in interpretation and can lead to overlap.The poorest and strongest inter-reader agreement were also matched with the same features identified by Hoang’s boardcertified radiologists, indicating that the limitation may be inherent to the reporting and data system rather than trainee experience.

Thyroid nodules are common and often incidental.The American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) standardizes the use of ultrasound for thyroid nodule risk stratification.

The current study has several limitations.One limitation is the lack of a pathological reference standard.The reference standard was an expert consensus review by 3 board certified radiologists with Body Imaging fellowship and 1-14 years of clinical experience.However, it should be noted that this study is designed primarily to evaluate inter-reader reliability of radiology residents, and not the inherent performance of the ACR TI-RADS itself.As such, an expert consensus panel was deemed a practical reference standard, and one that simulates ‘real world’ clinical practice[9].Another limitation is the relatively small number of cases used.However, even with this limited number of cases, we were able to show statistically significant improvements in inter-reader agreement for the two major outcome variables (TIRADS level and ACR TI-RADS recommendations).While there is a relatively even distribution of TI-RADS levels among the test cases

non-random selection, there is uneven distribution of individual ultrasound features within the group.Of the 50 test cases, only 3 nodules demonstrated ‘lobulated or irregular’ margins (TI-RADS points +2), while the remaining 47 are ‘smooth’ or ‘ill-defined’ (TI-RADS points +0).A larger sample size can improve this and lead to more representative analysis of individual ultrasound features.Finally, training retention over time was not evaluated in this study, with the post-training testing performed two weeks after didactic and training case review.

CONCLUSlON

Then the Queen pondered the whole night over all the names she had ever heard, and sent a messenger to scour the land,32 and to pick up far and near any names he could come across

It’s good that there is a hospital just near their house, so my friends got me into it in some minutes. I was directed at once into the surgery where was scarified in respiratory tracts3 and extracted4 the button.

ARTlCLE HlGHLlGHTS

Research background

We also evaluated the relative sensitivity and specificity of the radiology residents in assigning TI-RADS levels compared to consensus reference standard before and after training.There was a general trend towards improved pooled sensitivity with TIRADS levels 1 to 4 for the post-training assessment while the pooled specificity was relatively high (76.6-96.8%) for all TI-RADS level.Overall findings suggest that a single didactic training session improves the detection of benign (TI-RADS 1-3) lesions whileretaining high specificity in radiology residents.Improved identification of benign lesions is critical in avoiding unnecessary biopsies and interventions, a major aim of the ACR TI-RADS system.

Research motivation

Despite the widespread usage of this system, the learnability of TI-RADS has not been proven in radiology trainees.

Research objectives

To evaluate the inter-reader reliability amongst radiology trainees before and after TI RADS training.

Research methods

Three PGY-4 radiology residents were evaluated for inter-reader reliability with a 50 thyroid nodule data set before and after a 1-hour didactic teaching session and review of a training data set, with assessment performed 6 wk apart.Performance was compared to a consensus panel reference standard of three fellowship trained radiologists.

Seventeen summers ago, Muriel and I began our journey into the twilight1. It s midnight now, at least for her, and sometimes I wonder when dawn will break. Even the dreaded2 Alzheimer s disease isn t supposed to attack so early and torment3 so long. Yet, in her silent world, Muriel is so content, so lovable. If she were to die, how I would miss her gentle, sweet presence. Yes, there are times when I get irritated, but not often. It doesn t make sense to get angry. And besides, perhaps God has been answering the prayer of my youth to mellow4 my spirit.

Research results

After one session of dedicated TI-RADS training, the radiology residents demonstrated statistically significant improvement in inter-reader agreement in subcategories of "shape", "echogenic foci", "TI-RADS level", and "recommendations" when compared with expert panel consensus.A trend towards higher pooled sensitivity for TI-RADS level 1-4 is also observed.

8. Alone in the wood: Julius Heusher states that the woods represent the loss of security and previous values (Heuscher 1974).Return to place in story.

Research conclusions

Resident trainees demonstrated a statistically significant improvement in inter-reader agreement for both TI-RADS level and recommendations after training.This study demonstrates the learnability of the ACR TI-RADS.

Research perspectives

A multi-institutional and multi-national assessment of radiology resident diagnostic accuracy and inter-reader reliability of ACR TI-RADS classification and recommendations before and after training would improve the generalizability of these results.

1 Gharib H, Papini E, Garber JR, Duick DS, Harrell RM, Hegedüs L, Paschke R, Valcavi R, Vitti P; AACE/ACE/AME Task Force on Thyroid Nodules.American association of clinical endocrinologists, american college of endocrinology, and associazione medici endocrinologi medical guidelines for clinical practice for the diagnosis and management of thyroid nodules--2016 update.

2016; 22: 622-639 [PMID: 27167915 DOI: 10.4158/EP161208.GL]

2 Grani G, Lamartina L, Cantisani V, Maranghi M, Lucia P, Durante C.Interobserver agreement of various thyroid imaging reporting and data systems.

2018; 7: 1-7 [PMID: 29196301 DOI: 10.1530/EC-17-0336]

3 Smith-Bindman R, Lebda P, Feldstein VA, Sellami D, Goldstein RB, Brasic N, Jin C, Kornak J.Risk of thyroid cancer based on thyroid ultrasound imaging characteristics: results of a populationbased study.

2013; 173: 1788-1796 [PMID: 23978950 DOI: 10.1001/jamainternmed.2013.9245]

4 Lim H, Devesa SS, Sosa JA, Check D, Kitahara CM.Trends in Thyroid Cancer Incidence and Mortality in the United States, 1974-2013.

2017; 317: 1338-1348 [PMID: 28362912 DOI: 10.1001/jama.2017.2719]

5 Tessler FN, Middleton WD, Grant EG, Hoang JK, Berland LL, Teefey SA, Cronan JJ, Beland MD, Desser TS, Frates MC, Hammers LW, Hamper UM, Langer JE, Reading CC, Scoutt LM, Stavros AT.ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee.

2017; 14: 587-595 [PMID: 28372962 DOI: 10.1016/j.jacr.2017.01.046]

6 Ha EJ, Na DG, Baek JH, Sung JY, Kim JH, Kang SY.US Fine-Needle Aspiration Biopsy for Thyroid Malignancy: Diagnostic Performance of Seven Society Guidelines Applied to 2000 Thyroid Nodules.

2018; 287: 893-900 [PMID: 29465333 DOI: 10.1148/radiol.2018171074]

7 Hoang JK, Middleton WD, Farjat AE, Teefey SA, Abinanti N, Boschini FJ, Bronner AJ, Dahiya N, Hertzberg BS, Newman JR, Scanga D, Vogler RC, Tessler FN.Interobserver Variability of Sonographic Features Used in the American College of Radiology Thyroid Imaging Reporting and Data System.

2018; 211: 162-167 [PMID: 29702015 DOI: 10.2214/AJR.17.19192]

8 Teng D, Fu P, Li W, Guo F, Wang H.Learnability and reproducibility of ACR Thyroid Imaging, Reporting and Data System (TI-RADS) in postgraduate freshmen.

2020; 67: 643-650 [PMID: 31919768 DOI: 10.1007/s12020-019-02161-y]

9 Pi Y, Wilson MP, Katlariwala P, Sam M, Ackerman T, Paskar L, Patel V, Low G.Diagnostic accuracy and inter-observer reliability of the O-RADS scoring system among staff radiologists in a North American academic clinical setting.

2021; 46: 4967-4973 [PMID: 34185128 DOI: 10.1007/s00261-021-03193-7]

10 McHugh ML.Interrater reliability: the kappa statistic.

2012; 22: 276-282 [PMID: 23092060 DOI: 10.11613/BM.2012.031]

11 Gwet KL.Testing the Difference of Correlated Agreement Coefficients for Statistical Significance.

2016; 76: 609-637 [PMID: 29795880 DOI: 10.1177/0013164415596420]

12 Li W, Wang Y, Wen J, Zhang L, Sun Y.Diagnostic Performance of American College of Radiology TI-RADS: A Systematic Review and Meta-Analysis.

2021; 216: 38-47 [PMID: 32603229 DOI: 10.2214/AJR.19.22691]

13 Chung R, Rosenkrantz AB, Bennett GL, Dane B, Jacobs JE, Slywotzky C, Smereka PN, Tong A, Sheth S.Interreader Concordance of the TI-RADS: Impact of Radiologist Experience.

2020; 214: 1152-1157 [PMID: 32097031 DOI: 10.2214/AJR.19.21913]

14 Seifert P, Görges R, Zimny M, Kreissl MC, Schenke S.Interobserver agreement and efficacy of consensus reading in Kwak-, EU-, and ACR-thyroid imaging recording and data systems and ATA guidelines for the ultrasound risk stratification of thyroid nodules.

2020; 67: 143-154 [PMID: 31741167 DOI: 10.1007/s12020-019-02134-1]