machinelearning - A little place on the Internet

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI ( arxiv.org )

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts...

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks ( aclanthology.org )

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse...

Demystifying CLIP Data ( arxiv.org )

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP...

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models ( openaccess.thecvf.com )

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for...

A Long Way to Go: Investigating Length Correlations in RLHF ( arxiv.org )

Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering,...

Think before you speak: Training Language Models With Pause Tokens ( arxiv.org )

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token?...

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No ( arxiv.org )

Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD...

Language Modeling Is Compression ( arxiv.org )

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive...

Scaling Vision-Language Models with Sparse Mixture of Experts ( arxiv.org )

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these...

Hydra-MoE: A new class of Open-Source Mixture of Experts ( github.com )

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training ( aclanthology.org )

Multilingual Vision-Language Pre-training (VLP) is a promising but challenging topic due to the lack of large-scale multilingual image-text pairs. Existing works address the problem by translating English data into other languages, which is intuitive and the generated data is usually limited in form and scale. In this paper, we...

Real-Time Radiance Field Rendering ( huggingface.co )

Achieves SOTA on quality AND on training time AND renders in real-time (60fps+)

Introducing Keras Core: Keras for TensorFlow, JAX, and PyTorch. ( keras.io )

Keras 3.0 now works with TensorFlow, JAX and PyTorch. Also introduces a bunch new features. Check it out.

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time ( proceedings.mlr.press )

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large...

NeurIPS 2023 Machine Unlearning Challenge ( unlearning-challenge.github.io )

Deep neural networks are at the center of rapid progress in AI, with applications to computer vision, natural language processing, speech recognition and others. While this progress offers many exciting opportunities, it also introduces new challenges, as we researchers bear the responsibility to understand and mitigate the...

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages ( arxiv.org )

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer...

PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news ( blog.mithrilsecurity.io )

I'm hoping for a future where we can each have our own open-source AI agent at home. Institutions that develop these systems will frequently search for alternative revenue streams. Sneaking misinformation and bias into a model may be one of them. We need ways to guard against that....

GitHub - mazzzystar/Queryable: Run CLIP on iPhone to Search Photos. ( github.com )

The open source version of Queryable, an iOS app the CLIP model on iOS to search the Photos album offline....

CoDi: Generate Anything from Anything All At Once through Composable Diffusion ( codi-gen.github.io )

Abstract:...

GitHub - PiotrNawrot/nanoT5: Fast & Simple repository for pre-training and fine-tuning T5-style models ( github.com )

This repository comprises the code to reproduce the pre-training of a "Large Language Model" (T5) under a limited budget (1xA100 GPU, < 24 hours) in PyTorch. We start from the randomly initialised T5-base-v1.1 (248M parameters) model, and we pre-train it on the English subset of the C4 dataset and then fine-tune it on...

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing ( arxiv.org )

Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and...

Voice Conversion With Just Nearest Neighbors ( arxiv.org )

TL;DR: want to convert your voice to another person's voice? Or even to a whisper? Or a dog barking? Or to any other random speech clip? Give our new method a try: https://bshall.github.io/knn-vc...

Machine Learning Beginner Info/Resources ( kbin.social )

MOOCs...

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning ( arxiv.org )

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The...

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors ( aclanthology.org )

Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In...