[GH-ISSUE #7644] [FR] Local inference for mobile app using llama.cpp #3386

New issue

Open

opened 2026-03-23 21:29:44 +00:00 by mirror · 2 comments

mirror commented

2026-03-23 21:29:44 +00:00

Owner

Originally created by @rampa3 on GitHub (Mar 29, 2025).
Original GitHub issue: https://github.com/AppFlowy-IO/AppFlowy/issues/7644

Description

I would like to suggest implementing an option to use local LLM inference on mobile devices using llama.cpp library and either user provided or by the app downloaded quantized GGUF variant of LLM model. I believe such function would be feasible, since most middle tier mobile phones nowadays are capable of running usually a Q4_K_M quantization (the medium balanced quality/speed option) of 7B variants of many models at slower than PC, but bearable speed.

Impact

Implementing this would benefit users who are not always able to access internet on their mobile devices, plus those who would wish a privacy of local LLM on the go.

Additional Context

Inspired by addition of local inference option into desktop version of AppFlowy.

Originally created by @rampa3 on GitHub (Mar 29, 2025). Original GitHub issue: https://github.com/AppFlowy-IO/AppFlowy/issues/7644 ### Description I would like to suggest implementing an option to use local LLM inference on mobile devices using llama.cpp library and either user provided or by the app downloaded quantized GGUF variant of LLM model. I believe such function would be feasible, since most middle tier mobile phones nowadays are capable of running usually a Q4_K_M quantization (the medium balanced quality/speed option) of 7B variants of many models at slower than PC, but bearable speed. ### Impact Implementing this would benefit users who are not always able to access internet on their mobile devices, plus those who would wish a privacy of local LLM on the go. ### Additional Context Inspired by addition of local inference option into desktop version of AppFlowy.

mirror added the

mobile

new feature

labels

2026-03-23 21:29:44 +00:00

mirror commented

2026-03-23 21:29:45 +00:00

Author

Owner

@m13v commented on GitHub (Mar 18, 2026):

running llama.cpp on mobile is totally viable now, especially with Q4_K_M quantization on newer phones. we went through a similar evaluation for our macOS agent and the main consideration was memory pressure - on devices with 6GB RAM you need to be careful about model loading/unloading or the OS will kill your app. also worth looking at CoreML conversion for Apple devices since the Neural Engine is significantly faster than CPU inference for supported architectures.

@m13v commented on GitHub (Mar 18, 2026): running llama.cpp on mobile is totally viable now, especially with Q4_K_M quantization on newer phones. we went through a similar evaluation for our macOS agent and the main consideration was memory pressure - on devices with 6GB RAM you need to be careful about model loading/unloading or the OS will kill your app. also worth looking at CoreML conversion for Apple devices since the Neural Engine is significantly faster than CPU inference for supported architectures.

mirror commented

2026-03-23 21:29:45 +00:00

Author

Owner

@m13v commented on GitHub (Mar 18, 2026):

for reference, here's how we handle the provider layer that can switch between local and cloud inference: https://github.com/m13v/fazm/blob/main/Desktop/Sources/Providers/ChatProvider.swift

and the transcription service that uses on-device WhisperKit for speech-to-text: https://github.com/m13v/fazm/blob/main/Desktop/Sources/TranscriptionService.swift

@m13v commented on GitHub (Mar 18, 2026): for reference, here's how we handle the provider layer that can switch between local and cloud inference: https://github.com/m13v/fazm/blob/main/Desktop/Sources/Providers/ChatProvider.swift and the transcription service that uses on-device WhisperKit for speech-to-text: https://github.com/m13v/fazm/blob/main/Desktop/Sources/TranscriptionService.swift

mirror referenced this issue

2026-03-23 22:19:56 +00:00

[PR #3386] [MERGED] Update sv.json #5706