𝕏 | LinkedIn | GitHub | E-mail | RSS
-
Capire gli embedding con EmbeddingGemma
Si parla molto di LLM, i cosiddetti Large Language Models come ChatGPT, Gemini o Llama, modelli che sanno scrivere testi, rispondere a domande, riassumere documenti. Insomma addestrati per generare linguaggio. Accanto a questa famiglia esiste un altro tipo di modello, meno conosciuto ma non per questo meno importante. Sono i modelli di embedding. A differenza degli LLM, questi non producono frasi, il loro scopo è quello di prendere un testo e trasformarlo in una sequenza di numeri, un vettore che ne rappresenta il significato. Anche un LLM, per funzionare, utilizza internamente un sistema di embedding. Ogni parola, ogni pezzo di parola, viene trasformato in numeri prima di poter essere elaborato. La differenza è che negli LLM questo passaggio rimane nascosto, serve solo come base per arrivare alla generazione del linguaggio. Nei modelli di embedding, invece, questa trasformazione è l’obiettivo stesso. Per intenderci, se chiediamo ad un LLM “cos’è una firma... Read all →
-
Weights manipulation
Some time ago I asked myself: do we really need many days of calculation and powerful GPUs to understand how an open weights language model manages its safety mechanisms? More important, is there a fast and reversible way that does not need the creation of abliterated models to make the model more obedient to specific requests? From that question I started a research that was not easy. Making the code work with different models took time, because of many adjustments to fix library problems and memory limits. After I got a working version (not completely stable), I tried a different approach: change in real time the embedding weights of specific tokens, reducing step by step the ones linked to refusal (sorry, cannot, dangerous) and increasing the ones linked to compliance (sure, help, explain). With some models this worked well: small changes to refusal tokens slowly weakened the safety mechanisms, with... Read all →
-
Sentenza
The division of texts into chunks is an important step to build a good vector database. When embeddings are created for RAG systems, the size and meaning of the segments directly affect the accuracy and relevance of search results. Chunks that are too short break the content too much, while chunks that are too long risk joining unrelated information, making queries less effective. The whole process depends on the tokenizer used. Finding the right sentence boundaries is important to apply good chunking strategies. The histogram in the figure, made from a literary text, shows the distribution of 1011 sentences, with an average length of 118.48 characters and a standard deviation of 94.49. This shows that the sentence lengths in the corpus are very different. A limit of the current method comes from the asymmetric distribution of sentence lengths, with a long tail on the right (see graph). This means that... Read all →
-
Hello World
This is my first post on GitHub Pages. Read all →