Offline AI: Deploying LLMs on Android Devices

One of the largest barriers to equitable education technology is reliable internet access. When deploying the PBot AI Tutor into rural Malaysian schools, we recognized that a purely cloud-based API architecture would leave many students behind.

To solve this, we brought the AI directly onto the device.

The Challenge

Running Large Language Models (LLMs) on mobile devices presents heavy computational and memory constraints.

The Solution

We selected Google's Gemma 3 1B as the base model. Its parameter count makes it incredibly efficient. We utilized Unsloth + LoRA to fine-tune the model off-device using curated, multi-turn Malaysian curriculum QA datasets, achieving high precision in both Bahasa Melayu and English without bloating the model weights.

The Code

We converted the model to TFLite format and integrated it using Kotlin Multiplatform. Here is a simplified code snippet of our local inference block wrapper:

class LocalAITutor {
    private var interpreter: Interpreter? = null

    fun initialize(context: Context, modelName: String) {
        val options = Interpreter.Options().apply {
            setNumThreads(4)
            useNNAPI = true // Hardware acceleration
        }
        val modelBuffer = loadModelFile(context, modelName)
        interpreter = Interpreter(modelBuffer, options)
    }

    fun generateResponse(prompt: String): String {
        // Tokenization logic omitted for brevity
        val inputValues = tokenize(prompt)
        val outputBuffer = FloatArray(MAX_TOKENS)

        interpreter?.run(inputValues, outputBuffer)
        
        return decode(outputBuffer)
    }
}

References and Resources

To replicate or learn more about this approach:

Unsloth AI GitHub Repository - The fastest fine-tuning library out there.
Google Gemma Release Notes
Kotlin Multiplatform Mobile (KMM)

It's clear that on-device AI is the next frontier, providing privacy, low latency, and offline equity.