Embodied theories propose that language is understood via mental simulations of sensory states related to perception and action. Given that direct speech (e.g., She says, “It’s a lovely day!”) is perceived to be more vivid than indirect speech (e.g., She says (that) it’s a lovely day) in perception, recent research shows in silent reading that more vivid speech representations are mentally simulated for direct speech than for indirect speech. This ‘simulated’ speech is found to contain suprasegmental prosodic representations (e.g., speech prosody) but its phonological detail and its causal role in silent reading of direct speech remain unclear. Here in three experiments, I explored the phonological aspect and the causal role of speech simulations in silent reading of tongue twisters in direct speech, indirect speech and non-speech sentences. The results demonstrated greater visual tongue-twister effects (phonemic interference) during silent reading (Experiment 1) but not oral reading (Experiment 2) of direct speech as compared to indirect speech and non-speech. The tongue-twister effects in silent reading of direct speech were selectively disrupted by phonological interference (concurrent articulation) as compared to manual interference (finger tapping) (Experiment 3). The results replicated more vivid speech simulations in silent reading of direct speech, and additionally extended them to the phonological dimension. Crucially, they demonstrated a causal role of phonological simulations in silent reading of direct speech, at least in tongue-twister reading. The findings are discussed in relation to multidimensionality and task dependence of mental simulation and its mechanisms.