Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

other backends such as whisper.cpp? #33

Open
BBC-Esq opened this issue Feb 22, 2024 · 7 comments
Open

other backends such as whisper.cpp? #33

BBC-Esq opened this issue Feb 22, 2024 · 7 comments

Comments

@BBC-Esq
Copy link

BBC-Esq commented Feb 22, 2024

If you're collecting backends, I'd be very interested in seeing whisper.cpp as a possible backend. Here are some links:

https://github.com/ggerganov/whisper.cpp
https://github.com/abdeladim-s/pywhispercpp
https://github.com/tigros/Whisperer
https://github.com/Const-me/Whisper
https://github.com/aarnphm/whispercpp

@shashikg
Copy link
Owner

shashikg commented Mar 1, 2024

I think whisper.cpp does not support batching. Do you know of any community implementation for batched whisper.cpp ?

@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 1, 2024

The "tigros" link i gave you above, the guy names it "batch" but I'm not sure if it's "batch" in the same sense as you mean the word in a technical sense...

@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 1, 2024

And apparently he uses the the "const-me/whisper" repository's approach...and just creates multiple instances though, which might be different technically than what you're referring to?

However, correct me if I'm wrong, but weren't you the one to implement batch processing with ctranslate2? I don't know of anyone else who did before you, and it'd been something I'd been looking for for awhile. WhisperX I guess kind of did it I guess...I know there was some discussion on faster-whisper about it, but I didn't think that he actually did it.

That's why I thought you could implement batch processing in whisper.cpp if it didn't already exist?

@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 1, 2024

Upon further researching the issue...am I correct in understanding that you're referring to batch processing capabilities like this method within the ctranslate2 library:

Whisper::generate(const StorageView& features,
                      std::vector<std::vector<std::string>> prompts,
                      WhisperOptions options) {
      const size_t batch_size = features.dim(0);
      return post_batch<WhisperGenerationResult>(
        [features = features.sync_copy(),
         prompts = std::move(prompts),
         options = std::move(options)]
        (WhisperReplica& replica) mutable {
          return replica.generate(std::move(features), prompts, options);
        },
        batch_size);
    }

I believe that your program primarily harnesses the inherent batch processing capabilities of ctranslate2 in this manner in contrast to the faster-whisper library. WhisperS2T sends an array whereas faster-whisper doesn't basically?

And you're wondering if whisper.cpp has something similar?

@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 4, 2024

I did some further research regarding the whisper.cpp library and here's what I found. Within the script itself named whisper.cpp.

https://github.com/ggerganov/whisper.cpp/blob/master/whisper.cpp

I "believe" that it allows for batch processing in this snippet. I will first provide you with the portions from the "cpp" repository that pertain to batch processing...then, if I'm able, I'll locate any python bindings for whisper.cpp that implement the batch processing feature...keep in mind that the python bindings I've found don't stay up to date as often as the cpp repository...Here goes:

MULTIPLE EXAMPLES OF BATCH REFERENCES FROM WHISPER.CPP
struct whisper_batch {
    int32_t n_tokens;

    whisper_token  *  token;
    whisper_pos    *  pos;
    int32_t        *  n_seq_id;
    whisper_seq_id ** seq_id;   // null terminated
    int8_t         *  logits;
};

static struct whisper_batch whisper_batch_init(int32_t n_tokens, int32_t n_seq_max) {
    whisper_batch batch = { 0, nullptr, nullptr, nullptr, nullptr, nullptr, };

    batch.token    = (whisper_token *  ) malloc(sizeof(whisper_token)    * (n_tokens));
    batch.pos      = (whisper_pos *)     malloc(sizeof(whisper_pos)      * (n_tokens));
    batch.n_seq_id = (int32_t *)         malloc(sizeof(int32_t)          * (n_tokens));
    batch.seq_id   = (whisper_seq_id **) malloc(sizeof(whisper_seq_id *) * (n_tokens + 1));
    for (int i = 0; i < n_tokens; ++i) {
        batch.seq_id[i] = (whisper_seq_id *) malloc(sizeof(whisper_seq_id)   * n_seq_max);
    }
    batch.seq_id[n_tokens] = nullptr;
    batch.logits   = (int8_t *)          malloc(sizeof(int8_t)           * n_tokens);

    return batch;
}

static void whisper_batch_free(struct whisper_batch batch) {
    if (batch.token)    free(batch.token);
    if (batch.pos)      free(batch.pos);
    if (batch.n_seq_id) free(batch.n_seq_id);
    if (batch.seq_id) {
        for (int i = 0; batch.seq_id[i]; ++i) {
            free(batch.seq_id[i]);
        }
        free(batch.seq_id);
    }
    if (batch.logits)   free(batch.logits);
}

static void whisper_batch_prep_legacy(whisper_batch & batch, const whisper_token * tokens, int n_tokens, int n_past, int seq_id) {
    batch.n_tokens = n_tokens;
    for (int i = 0; i < n_tokens; ++i) {
        if (tokens) {
            batch.token[i] = tokens[i];
        }
        batch.pos     [i]    = n_past + i;
        batch.n_seq_id[i]    = 1;
        batch.seq_id  [i][0] = seq_id;
        batch.logits  [i]    = 0;
    }
    batch.logits[n_tokens - 1] = 1;
}

static struct ggml_cgraph * whisper_build_graph_decoder(
whisper_context & wctx,
whisper_state & wstate,
const whisper_batch & batch,
bool worst_case) {
const auto & model = wctx.model;
const auto & hparams = model.hparams;

auto & kv_self = wstate.kv_self;

WHISPER_ASSERT(!!kv_self.ctx);

const int n_ctx   = kv_self.size;
const int n_state = hparams.n_text_state;
const int n_head  = hparams.n_text_head;
const int n_layer = hparams.n_text_layer;

const int n_tokens    = batch.n_tokens;
const int n_audio_ctx = wstate.exp_n_audio_ctx > 0 ? wstate.exp_n_audio_ctx : hparams.n_audio_ctx;

const int32_t n_kv     = worst_case ? n_ctx            : kv_self.n;
const int32_t kv_head  = worst_case ? n_ctx - n_tokens : kv_self.head;

//WHISPER_LOG_DEBUG("%s: n_past = %d, n_tokens = %d, n_audio_ctx = %d, n_ctx = %d\n", __func__, n_past, n_tokens, n_audio_ctx, n_ctx);

struct ggml_init_params params = {
    /*.mem_size   =*/ wstate.alloc_decode.meta.size(),
    /*.mem_buffer =*/ wstate.alloc_decode.meta.data(),
    /*.no_alloc   =*/ true,
};

struct ggml_context * ctx0 = ggml_init(params);

ggml_cgraph * gf = ggml_new_graph_custom(ctx0, WHISPER_MAX_NODES, false);

struct ggml_tensor * embd = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
ggml_set_name(embd, "embd");
ggml_set_input(embd);

struct ggml_tensor * position = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
ggml_set_name(position, "position");
ggml_set_input(position);

const float KQscale = pow(float(n_state)/n_head, -0.25);

struct ggml_tensor * KQ_mask = ggml_new_tensor_3d(ctx0, GGML_TYPE_F32, n_kv, n_tokens, 1);
ggml_set_name(KQ_mask, "KQ_mask");
ggml_set_input(KQ_mask);

// token encoding + position encoding
struct ggml_tensor * cur =
    ggml_add(ctx0,
            ggml_get_rows(ctx0, model.d_te, embd),
            ggml_get_rows(ctx0, model.d_pe, position));

struct ggml_tensor * inpL = cur;

for (int il = 0; il < n_layer; ++il) {
    const auto & layer = model.layers_decoder[il];

    // norm
    {
        cur = ggml_norm(ctx0, inpL, hparams.eps);

        // cur = ln_0_w*cur + ln_0_b
        cur = ggml_add(ctx0,
                ggml_mul(ctx0,
                    cur,
                    layer.attn_ln_0_w),
                layer.attn_ln_0_b);
    }

    // self-attention
    {
        struct ggml_tensor * Qcur = ggml_mul_mat(ctx0,
                layer.attn_q_w,
                cur);

        Qcur = ggml_add(ctx0,
                    Qcur,
                    layer.attn_q_b);

        Qcur = ggml_scale(ctx0, Qcur, KQscale);

        // note: no bias for Key
        struct ggml_tensor * Kcur = ggml_mul_mat(ctx0,
                layer.attn_k_w,
                cur);

        Kcur = ggml_scale(ctx0, Kcur, KQscale);

        // store key and value to memory
        {
            struct ggml_tensor * Vcur = ggml_mul_mat(ctx0,
                    layer.attn_v_w,
                    cur);

            Vcur = ggml_add(ctx0,
                        Vcur,
                        layer.attn_v_b);

            Vcur = ggml_transpose(ctx0, ggml_reshape_2d(ctx0, Vcur, n_state, n_tokens));

            struct ggml_tensor * k = ggml_view_1d(ctx0, kv_self.k, n_tokens*n_state, (ggml_element_size(kv_self.k)*n_state)*(il*n_ctx + kv_head));
            struct ggml_tensor * v = ggml_view_2d(ctx0, kv_self.v, n_tokens, n_state,
                    (   n_ctx)*ggml_element_size(kv_self.v),
                    (il*n_ctx)*ggml_element_size(kv_self.v)*n_state + kv_head*ggml_element_size(kv_self.v));

            ggml_build_forward_expand(gf, ggml_cpy(ctx0, Kcur, k));
            ggml_build_forward_expand(gf, ggml_cpy(ctx0, Vcur, v));
        }

        // ------

        struct ggml_tensor * Q =
            ggml_permute(ctx0,
                    ggml_reshape_3d(ctx0, Qcur, n_state/n_head, n_head, n_tokens),
                    0, 2, 1, 3);

        struct ggml_tensor * K =
            ggml_view_3d(ctx0, kv_self.k,
                    n_state/n_head, n_kv, n_head,
                    ggml_element_size(kv_self.k)*n_state,
                    ggml_element_size(kv_self.k)*n_state/n_head,
                    ggml_element_size(kv_self.k)*n_state*n_ctx*il);

        // K * Q
        struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q);

        //struct ggml_tensor * KQ_scaled = ggml_scale(ctx0, KQ, KQ_scale);

        //struct ggml_tensor * KQ_masked = ggml_diag_mask_inf(ctx0, KQ, n_past);
        struct ggml_tensor * KQ_masked = ggml_add(ctx0, KQ, KQ_mask);

        struct ggml_tensor * KQ_soft_max = ggml_soft_max(ctx0, KQ_masked);

        struct ggml_tensor * V =
            ggml_view_3d(ctx0, kv_self.v,
                    n_kv, n_state/n_head, n_head,
                    n_ctx*ggml_element_size(kv_self.v),
                    n_ctx*ggml_element_size(kv_self.v)*n_state/n_head,
                    n_ctx*ggml_element_size(kv_self.v)*n_state*il);

        struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max);

        struct ggml_tensor * KQV_merged = ggml_permute(ctx0, KQV, 0, 2, 1, 3);

        cur = ggml_cpy(ctx0,
                KQV_merged,
                ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_state, n_tokens));
    }

    // projection
    {
        cur = ggml_mul_mat(ctx0,
                layer.attn_ln_1_w,
                cur);

        cur = ggml_add(ctx0,
                cur,
                layer.attn_ln_1_b);
    }

    // add the input
    struct ggml_tensor * inpCA = ggml_add(ctx0, cur, inpL);

    // norm
    {
        cur = ggml_norm(ctx0, inpCA, hparams.eps); // note: we use inpCA here

        // cur = ln_0_w*cur + ln_0_b
        cur = ggml_add(ctx0,
                ggml_mul(ctx0,
                    cur,
                    layer.cross_attn_ln_0_w),
                layer.cross_attn_ln_0_b);
    }

    // cross-attention
    {
        struct ggml_tensor * Qcur = ggml_mul_mat(ctx0,
                layer.cross_attn_q_w,
                cur);

        Qcur = ggml_add(ctx0,
                    Qcur,
                    layer.cross_attn_q_b);

        Qcur = ggml_scale(ctx0, Qcur, KQscale);

        // Kcross is already scaled
        struct ggml_tensor * Kcross =
            ggml_view_3d(ctx0, wstate.kv_cross.k,
                    n_state/n_head, n_audio_ctx, n_head,
                    ggml_element_size(wstate.kv_cross.k)*n_state,
                    ggml_element_size(wstate.kv_cross.k)*n_state/n_head,
                    ggml_element_size(wstate.kv_cross.k)*n_state*n_audio_ctx*il);

        //struct ggml_tensor * Vcross =
        //    ggml_reshape_3d(ctx0,
        //            ggml_view_1d(ctx0, wstate.kv_cross.v, n_audio_ctx*n_state, il*n_audio_ctx*ggml_element_size(wstate.kv_cross.v)*n_state),
        //            n_state/n_head, n_head, n_audio_ctx);

        //struct ggml_tensor * V_trans =
        //    ggml_cpy(ctx0,
        //            ggml_permute(ctx0, Vcross, 1, 2, 0, 3),
        //            ggml_new_tensor_3d(ctx0, Vcross->type, n_audio_ctx, n_state/n_head, n_head));

        struct ggml_tensor * V =
            ggml_view_3d(ctx0, wstate.kv_cross.v,
                    n_audio_ctx, n_state/n_head, n_head,
                    n_audio_ctx*ggml_element_size(wstate.kv_cross.v),
                    n_audio_ctx*ggml_element_size(wstate.kv_cross.v)*n_state/n_head,
                    n_audio_ctx*ggml_element_size(wstate.kv_cross.v)*n_state*il);

        // ------

        struct ggml_tensor * Q =
            ggml_permute(ctx0,
                    ggml_reshape_3d(ctx0, Qcur, n_state/n_head, n_head, n_tokens),
                    0, 2, 1, 3);

        // K * Q
        struct ggml_tensor * KQ = ggml_mul_mat(ctx0, Kcross, Q);

        //struct ggml_tensor * KQ_scaled =
        //    ggml_scale(ctx0,
        //            KQ,
        //            ggml_new_f32(ctx0, 1.0f/sqrt(float(n_state)/n_head))
        //            );

        // no masking for cross-attention
        //struct ggml_tensor * KQ_masked = ggml_diag_mask_inf(ctx0, KQ_scaled, n_past);

        struct ggml_tensor * KQ_soft_max = ggml_soft_max(ctx0, KQ);

        struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max);

        struct ggml_tensor * KQV_merged = ggml_permute(ctx0, KQV, 0, 2, 1, 3);

        // cur = KQV_merged.contiguous().view(n_state, n_tokens)
        cur = ggml_cpy(ctx0,
                KQV_merged,
                ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_state, n_tokens));
    }

    // projection
    {
        cur = ggml_mul_mat(ctx0,
                layer.cross_attn_ln_1_w,
                cur);

        cur = ggml_add(ctx0,
                cur,
                layer.cross_attn_ln_1_b);
    }

    // add the input
    cur = ggml_add(ctx0, cur, inpCA);

    struct ggml_tensor * inpFF = cur;

    // feed-forward network
    {
        // norm
        {
            cur = ggml_norm(ctx0, inpFF, hparams.eps);

            // cur = mlp_ln_w*cur + mlp_ln_b
            cur = ggml_add(ctx0,
                    ggml_mul(ctx0,
                        cur,
                        layer.mlp_ln_w),
                    layer.mlp_ln_b);
        }

        // fully connected
        cur = ggml_mul_mat(ctx0,
                layer.mlp_0_w,
                cur);

        cur = ggml_add(ctx0,
                cur,
                layer.mlp_0_b);

        // GELU activation
        cur = ggml_gelu(ctx0, cur);

        // projection
        cur = ggml_mul_mat(ctx0,
                layer.mlp_1_w,
                cur);

        cur = ggml_add(ctx0,
                cur,
                layer.mlp_1_b);
    }

    inpL = ggml_add(ctx0, cur, inpFF);
}

cur = inpL;

// norm
{
    cur = ggml_norm(ctx0, cur, hparams.eps);

    cur = ggml_add(ctx0,
            ggml_mul(ctx0,
                cur,
                model.d_ln_w),
            model.d_ln_b);
}

// compute logits only for the last token
// comment this line to compute logits for all n_tokens
// might be useful in the future
//cur = ggml_view_2d(ctx0, cur, cur->ne[0], 1, cur->nb[1], (cur->ne[1] - 1)*cur->nb[1]);

struct ggml_tensor * logits = ggml_mul_mat(ctx0, model.d_te, cur);

ggml_build_forward_expand(gf, logits);

ggml_free(ctx0);

return gf;

}

// evaluate the decoder
//
// given text prompt + audio features -> computes the logits for the next token
//
// - model: the model
// - n_threads: number of threads to use
// - tokens: text prompt
// - n_tokens: number of tokens in the prompt
// - n_past: number of past tokens to prefix the prompt with
//
static bool whisper_decode_internal(
whisper_context & wctx,
whisper_state & wstate,
const whisper_batch & batch,
const int n_threads,
ggml_abort_callback abort_callback,
void * abort_callback_data) {
const int64_t t_start_us = ggml_time_us();

const auto & model   = wctx.model;
const auto & hparams = model.hparams;

const int n_vocab  = hparams.n_vocab;
const int n_tokens = batch.n_tokens;

auto & logits_out = wstate.logits;

struct ggml_tensor * logits;

// find KV slot for the batch
{
    auto & kv_self = wstate.kv_self;

    if (!whisper_kv_cache_find_slot(kv_self, batch)) {
        return false;
    }

    kv_self.n = whisper_kv_cache_cell_max(kv_self);
    //kv_self.n = std::min((int32_t) hparams.n_text_ctx, std::max(32, whisper_kv_cache_cell_max(kv_self)));
    //printf("n_tokens = %5d, kv_self.head = %5d, kv_self.n = %5d, seq_id = %5d\n", batch.n_tokens, kv_self.head, kv_self.n, batch.seq_id[0][0]);
}

// decoder
{
    auto & alloc = wstate.alloc_decode.alloc;

    ggml_cgraph * gf = whisper_build_graph_decoder(wctx, wstate, batch, false);

    if (!ggml_gallocr_alloc_graph(alloc, gf)) {
        // should never happen as we pre-allocate the memory
        return false;
    }

    // set the inputs
    {
        struct ggml_tensor * embd = ggml_graph_get_tensor(gf, "embd");
        ggml_backend_tensor_set(embd, batch.token, 0, n_tokens*ggml_element_size(embd));
    }

    {
        struct ggml_tensor * position = ggml_graph_get_tensor(gf, "position");
        for (int i = 0; i < n_tokens; ++i) {
            const int32_t val = batch.pos[i];
            ggml_backend_tensor_set(position, &val, i*sizeof(int32_t), sizeof(int32_t));
        }
    }

    {
        struct ggml_tensor * KQ_mask = ggml_graph_get_tensor(gf, "KQ_mask");

        auto & kv_self = wstate.kv_self;
        const int32_t n_kv     = kv_self.n;

        wstate.inp_mask.resize(n_kv*n_tokens);

        float * data = wstate.inp_mask.data();
        memset(data, 0, ggml_nbytes(KQ_mask));

        for (int h = 0; h < 1; ++h) {
            for (int j = 0; j < n_tokens; ++j) {
                const whisper_pos    pos    = batch.pos[j];
                const whisper_seq_id seq_id = batch.seq_id[j][0];

                for (int i = 0; i < n_kv; ++i) {
                    if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) {
                        data[h*(n_kv*n_tokens) + j*n_kv + i] = -INFINITY;
                    }
                }
            }
        }

        ggml_backend_tensor_set(KQ_mask, wstate.inp_mask.data(), 0, ggml_nelements(KQ_mask)*sizeof(float));
    }

    logits = gf->nodes[gf->n_nodes - 1];

    if (!ggml_graph_compute_helper(wstate.backend, gf, n_threads)) {
        return false;
    }
}

logits_out.resize(n_tokens*n_vocab);
for (int i = 0; i < n_tokens; i++) {
    if (batch.logits[i] == 0) {
        continue;
    }
    ggml_backend_tensor_get(logits, logits_out.data() + (n_vocab*i), sizeof(float)*(n_vocab*i), sizeof(float)*n_vocab);
}

if (batch.n_tokens > 1) {
    //printf("%s: used_mem = %f MB, %f MB, %f MB %f MB %f MB\n", __func__,
    //        ggml_used_mem(ctx0)/1e6,
    //        wstate.get_buf_max_mem(0)/1e6,
    //        wstate.get_buf_max_mem(1)/1e6,
    //        wstate.get_buf_max_mem(2)/1e6,
    //        wstate.get_buf_max_mem(3)/1e6);
}

if (batch.n_tokens == 1) {
    wstate.t_decode_us += ggml_time_us() - t_start_us;
    wstate.n_decode++;
} else if (batch.n_tokens < 16) {
    wstate.t_batchd_us += ggml_time_us() - t_start_us;
    wstate.n_batchd += n_tokens;
} else {
    wstate.t_prompt_us += ggml_time_us() - t_start_us;
    wstate.n_prompt += n_tokens;
}

return !(abort_callback && abort_callback(abort_callback_data));

}


int whisper_decode_with_state(struct whisper_context * ctx, struct whisper_state * state, const whisper_token * tokens, int n_tokens, int n_past, int n_threads) {
whisper_batch_prep_legacy(state->batch, tokens, n_tokens, n_past, 0);

whisper_kv_cache_seq_rm(state->kv_self, 0, n_past, -1);

if (!whisper_decode_internal(*ctx, *state, state->batch, n_threads, nullptr, nullptr)) {
    WHISPER_LOG_ERROR("%s: failed to eval\n", __func__);
    return 1;
}

return 0;

}

int whisper_decode(struct whisper_context * ctx, const whisper_token * tokens, int n_tokens, int n_past, int n_threads) {
if (ctx->state == nullptr) {
WHISPER_LOG_ERROR("%s: ERROR state was not loaded.\n", func);
return -1;
}

return whisper_decode_with_state(ctx, ctx->state, tokens, n_tokens, n_past, n_threads);

}

int whisper_tokenize(struct whisper_context * ctx, const char * text, whisper_token * tokens, int n_max_tokens) {
const auto res = tokenize(ctx->vocab, text);

if (n_max_tokens < (int) res.size()) {
    WHISPER_LOG_ERROR("%s: too many resulting tokens: %d (max %d)\n", __func__, (int) res.size(), n_max_tokens);
    return -1;
}

for (int i = 0; i < (int) res.size(); i++) {
    tokens[i] = res[i];
}

return res.size();

}

int whisper_lang_max_id() {
auto max_id = 0;
for (const auto & kv : g_lang) {
max_id = std::max(max_id, kv.second.first);
}

return max_id;

}

int whisper_lang_id(const char * lang) {
if (!g_lang.count(lang)) {
for (const auto & kv : g_lang) {
if (kv.second.second == lang) {
return kv.second.first;
}
}

    WHISPER_LOG_ERROR("%s: unknown language '%s'\n", __func__, lang);
    return -1;
}
return g_lang.at(lang).first;

}

const char * whisper_lang_str(int id) {
for (const auto & kv : g_lang) {
if (kv.second.first == id) {
return kv.first.c_str();
}
}

WHISPER_LOG_ERROR("%s: unknown language id %d\n", __func__, id);
return nullptr;

}

const char * whisper_lang_str_full(int id) {
for (const auto & kv : g_lang) {
if (kv.second.first == id) {
return kv.second.second.c_str();
}
}

WHISPER_LOG_ERROR("%s: unknown language id %d\n", __func__, id);
return nullptr;

}

int whisper_lang_auto_detect_with_state(
struct whisper_context * ctx,
struct whisper_state * state,
int offset_ms,
int n_threads,
float * lang_probs) {
const int seek = offset_ms/10;

if (seek < 0) {
    WHISPER_LOG_ERROR("%s: offset %dms is before the start of the audio\n", __func__, offset_ms);
    return -1;
}

if (seek >= state->mel.n_len_org) {
    WHISPER_LOG_ERROR("%s: offset %dms is past the end of the audio (%dms)\n", __func__, offset_ms, state->mel.n_len_org*10);
    return -2;
}

// run the encoder
if (whisper_encode_with_state(ctx, state, seek, n_threads) != 0) {
    WHISPER_LOG_ERROR("%s: failed to encode\n", __func__);
    return -6;
}

const std::vector<whisper_token> prompt = { whisper_token_sot(ctx) };

if (whisper_decode_with_state(ctx, state, prompt.data(), prompt.size(), 0, n_threads) != 0) {
    WHISPER_LOG_ERROR("%s: failed to decode\n", __func__);
    return -7;
}

auto & logits_id = state->decoders[0].logits_id;
logits_id.clear();

for (const auto & kv : g_lang) {
    const auto token_lang = whisper_token_lang(ctx, kv.second.first);
    logits_id.emplace_back(state->logits[token_lang], kv.second.first);
}

// sort descending
{
    using pair_type = std::remove_reference<decltype(logits_id)>::type::value_type;
    std::sort(logits_id.begin(), logits_id.end(), [](const pair_type & a, const pair_type & b) {
        return a.first > b.first;
    });
}

// softmax
{
    const auto max = logits_id[0].first;

    double sum = 0.0f;
    for (auto & kv : logits_id) {
        kv.first = exp(kv.first - max);
        sum += kv.first;
    }

    for (auto & kv : logits_id) {
        kv.first /= sum;
    }
}

{
    for (const auto & prob : logits_id) {
        if (lang_probs) {
            lang_probs[prob.second] = prob.first;
        }

        //printf("%s: lang %2d (%3s): %f\n", __func__, prob.second, whisper_lang_str(prob.second), prob.first);
    }
}

return logits_id[0].second;

}

int whisper_lang_auto_detect(
struct whisper_context * ctx,
int offset_ms,
int n_threads,
float * lang_probs) {
return whisper_lang_auto_detect_with_state(ctx, ctx->state, offset_ms, n_threads, lang_probs);
}

int whisper_model_n_vocab(struct whisper_context * ctx) {
return ctx->model.hparams.n_vocab;
}

int whisper_model_n_audio_ctx(struct whisper_context * ctx) {
return ctx->model.hparams.n_audio_ctx;
}

int whisper_model_n_audio_state(struct whisper_context * ctx) {
return ctx->model.hparams.n_audio_state;
}

int whisper_model_n_audio_head(struct whisper_context * ctx) {
return ctx->model.hparams.n_audio_head;
}

int whisper_model_n_audio_layer(struct whisper_context * ctx) {
return ctx->model.hparams.n_audio_layer;
}

int whisper_model_n_text_ctx(struct whisper_context * ctx) {
return ctx->model.hparams.n_text_ctx;
}

int whisper_model_n_text_state(struct whisper_context * ctx) {
return ctx->model.hparams.n_text_state;
}

int whisper_model_n_text_head(struct whisper_context * ctx) {
return ctx->model.hparams.n_text_head;
}

int whisper_model_n_text_layer(struct whisper_context * ctx) {
return ctx->model.hparams.n_text_layer;
}

int whisper_model_n_mels(struct whisper_context * ctx) {
return ctx->model.hparams.n_mels;
}

int whisper_model_ftype(struct whisper_context * ctx) {
return ctx->model.hparams.ftype;
}

int whisper_model_type(struct whisper_context * ctx) {
return ctx->model.type;
}

const char *whisper_model_type_readable(struct whisper_context * ctx) {
switch (ctx->model.type) {
case e_model::MODEL_TINY:
return "tiny";
case e_model::MODEL_BASE:
return "base";
case e_model::MODEL_SMALL:
return "small";
case e_model::MODEL_MEDIUM:
return "medium";
case e_model::MODEL_LARGE:
return "large";
default:
return "unknown";
}
}

int whisper_n_len_from_state(struct whisper_state * state) {
return state->mel.n_len_org;
}

int whisper_n_len(struct whisper_context * ctx) {
return ctx->state->mel.n_len_org;
}

int whisper_n_vocab(struct whisper_context * ctx) {
return ctx->vocab.n_vocab;
}

int whisper_n_text_ctx(struct whisper_context * ctx) {
return ctx->model.hparams.n_text_ctx;
}

int whisper_n_audio_ctx(struct whisper_context * ctx) {
return ctx->model.hparams.n_audio_ctx;
}

int whisper_is_multilingual(struct whisper_context * ctx) {
return ctx->vocab.is_multilingual() ? 1 : 0;
}

float * whisper_get_logits(struct whisper_context * ctx) {
return ctx->state->logits.data();
}

float * whisper_get_logits_from_state(struct whisper_state * state) {
return state->logits.data();
}

const char * whisper_token_to_str(struct whisper_context * ctx, whisper_token token) {
return ctx->vocab.id_to_token.at(token).c_str();
}

whisper_token whisper_token_eot(struct whisper_context * ctx) {
return ctx->vocab.token_eot;
}

whisper_token whisper_token_sot(struct whisper_context * ctx) {
return ctx->vocab.token_sot;
}

whisper_token whisper_token_solm(struct whisper_context * ctx) {
return ctx->vocab.token_solm;
}

whisper_token whisper_token_prev(struct whisper_context * ctx) {
return ctx->vocab.token_prev;
}

whisper_token whisper_token_nosp(struct whisper_context * ctx) {
return ctx->vocab.token_nosp;
}

whisper_token whisper_token_not(struct whisper_context * ctx) {
return ctx->vocab.token_not;
}

whisper_token whisper_token_beg(struct whisper_context * ctx) {
return ctx->vocab.token_beg;
}

whisper_token whisper_token_lang(struct whisper_context * ctx, int lang_id) {
return whisper_token_sot(ctx) + 1 + lang_id;
}

whisper_token whisper_token_translate(struct whisper_context * ctx) {
return ctx->vocab.token_translate;
}

whisper_token whisper_token_transcribe(struct whisper_context * ctx) {
return ctx->vocab.token_transcribe;
}

void whisper_print_timings(struct whisper_context * ctx) {
const int64_t t_end_us = ggml_time_us();

WHISPER_LOG_INFO("\n");
WHISPER_LOG_INFO("%s:     load time = %8.2f ms\n", __func__, ctx->t_load_us / 1000.0f);
if (ctx->state != nullptr) {

    const int32_t n_sample = std::max(1, ctx->state->n_sample);
    const int32_t n_encode = std::max(1, ctx->state->n_encode);
    const int32_t n_decode = std::max(1, ctx->state->n_decode);
    const int32_t n_batchd = std::max(1, ctx->state->n_batchd);
    const int32_t n_prompt = std::max(1, ctx->state->n_prompt);

    WHISPER_LOG_INFO("%s:     fallbacks = %3d p / %3d h\n", __func__, ctx->state->n_fail_p, ctx->state->n_fail_h);
    WHISPER_LOG_INFO("%s:      mel time = %8.2f ms\n", __func__, ctx->state->t_mel_us / 1000.0f);
    WHISPER_LOG_INFO("%s:   sample time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_sample_us, n_sample, 1e-3f * ctx->state->t_sample_us / n_sample);
    WHISPER_LOG_INFO("%s:   encode time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_encode_us, n_encode, 1e-3f * ctx->state->t_encode_us / n_encode);
    WHISPER_LOG_INFO("%s:   decode time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_decode_us, n_decode, 1e-3f * ctx->state->t_decode_us / n_decode);
    WHISPER_LOG_INFO("%s:   batchd time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_batchd_us, n_batchd, 1e-3f * ctx->state->t_batchd_us / n_batchd);
    WHISPER_LOG_INFO("%s:   prompt time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_prompt_us, n_prompt, 1e-3f * ctx->state->t_prompt_us / n_prompt);
}
WHISPER_LOG_INFO("%s:    total time = %8.2f ms\n", __func__, (t_end_us - ctx->t_start_us)/1000.0f);

}

void whisper_reset_timings(struct whisper_context * ctx) {
ctx->t_start_us = ggml_time_us();
if (ctx->state != nullptr) {
ctx->state->t_mel_us = 0;
ctx->state->t_sample_us = 0;
ctx->state->t_encode_us = 0;
ctx->state->t_decode_us = 0;
ctx->state->t_batchd_us = 0;
ctx->state->t_prompt_us = 0;
ctx->state->n_sample = 0;
ctx->state->n_encode = 0;
ctx->state->n_decode = 0;
ctx->state->n_batchd = 0;
ctx->state->n_prompt = 0;
}
}


There is one more chunk starting on line 4478 of ```whisper.cpp``` and ending on line 5898 that I felt was too long to paste there.  I'll try to paste for your convenient, if i can get them, any python bindings regarding the batch functionality of ```whisper.cpp``` for you.  

</details>

There is one more chunk starting on line 4478 of ```whisper.cpp``` and ending on line 5898 that I felt was too long to paste there.  I'll try to paste for your convenient, if i can get them, any python bindings regarding the batch functionality of ```whisper.cpp``` for you.  Thanks!
@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 4, 2024

This also seems to confirm that whisper.cpp supports batch processing like huggingface and ctranslate2 do....

ggerganov/whisper.cpp#1486

@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 4, 2024

One the announcement for version 1.5 it states that they support batching...

https://github.com/ggerganov/whisper.cpp/releases?q=batch&expanded=true

And these python bindings claim to support whisper.cpp all the way to version 1.5.4:

https://github.com/abdeladim-s/pywhispercpp/releases/tag/v1.2.0

Note, the above bindings are NOT listed on whisper.cpp repo for some reason...the only two listed are the following, which haven't been updated in quite awhile...

https://github.com/stlukey/whispercpp.py
https://github.com/aarnphm/whispercpp

I don't know if this means that the owner of "abdeladim" just hasn't requested a community integration shoutout or what...so I wouldn't assume necessarily that his bindings aren't good...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants