[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: openai-whisper_0~20230314-1_amd64.changes REJECTED



On Thu, 2023-06-22 at 19:28 +0200, Petter Reinholdtsen wrote:
> [M. Zhou]
> > I'm 100% percent sure the .tiktoken files are vocabulary files summarized
> > from some corpus. It's just a "word" -> "id" mapping in plain text.
> 
> Thanks.
> 
> > In order to reproduce a similar vocabulary list, I believe you can do
> > it with a wikipedia dump. But I believe GPT2 was not trained on
> > wikipedia dump, but a much larger corpus.
> 
> Do you have a recipe on how to create such vocabulary list?

No. One may have to go through their GPT2 paper. But anyway,
as a researcher in the AI field, I can say that OpenAI is notorious
for not disclosing any detail about their latest models. They are
better named "ClosedAI".

> > That said, if openai-whisper is a inference-only package which does
> > not provide training scripts and enough details for the training
> > dataset, it should go non-free even if the tokenizers are crystal
> > clear.
> 
> I do not know if such training scripts are present, as I do not know how
> to recognize training scripts.  If I knew how training was done, I might
> grep in the source to see if the relevant keywords are in the source,
> but I do not know how it is done, and thus am a bit lost.

I'm a little bit biased. But if you cannot find any instruction for training
the model from scratch in their markdown/rst documentations, then
it is very much likely not provided.
This will help you narrow down the search range for "training".

My expectation is that they don't provide that at all. They always
do not provide detail for their latest models.


Reply to: