ETHGlobal London: My Beginning into Blockchain

Date: March 26, 2024, 11:19 p.m.

Tags: hackathon

After doing my first ever hackathon a week prior, an experience through which I learned so much and was able to build a really cool project (see my post about it here Encode Club AI Hackathon, I took the plunge and did my second hackathon in two weeks, and I am very glad I did.

I had the opportunity to take part in the amazing ETHGlobal London hackathon, a premier blockchain hackathon, where some of the best programmers get together and make some really cool blockchain inspired projects. Now being a single hackathon veteran, I knew what to expect to some degree, but when I arrived to the venue on Friday evening to meet my team members and get started hacking, I was awe-inspired by just how large the event really was. Not to mention the slight bit of nervousness that arose from seeing the number of hackers that were already deep into the screens of their sticker laden laptops, a sign of their previous hack experiences. However, I was also very excited, not only to be taking part in another hackathon, but also about the opportunity I would have before me to learn, especially about blockchain, which I did not have very much experience in.

The Idea

Like last time, meeting my team definitely calmed me down and made me even more excited to hack together cool project over the course of the weekend, not to mention potentially win some of the many bounties that were up for grabs. Along with my team, we came up with a very interesting idea, first to develop a digital marketplace for AI prompt sharing, where users could share prompts that they had created to leverage the some of the many generative AI tools available. The reward for sharing your prompts would be some sort of crypto tokens, and the benefit of using blockchain technology would be that if someone were to imporve upon, or generate something from the prompt that you contributed, it would be possible to prove your ownership of your original prompt, and thus some sort of royalties could be paid out. There were many positive points to this idea, mainly that we were using AI and blockchain in practical and useful ways, especially allowing user contribution and a more "distributed" marketplace, which encompasses the essence of blockchain tech as a whole. However, the prompt marketplace itself seemed a bit limited in its use, as prompts themselves are not an easily copywritable piece of work, and second, the non-deterministic nature of generative AI, meant that the same prompt would not necessarily generate the same resulting, image, or text etc, so it would be difficult to realize the value of the "collaborative" element of the marketplace, as it is hard to build on someone else's contribution when the result of the contribution is not deterministic. Instead, we decided to pivot a bit. I noticed that many people seemed to be taking pictures around us of the ETHGlobal event and so the idea of opening the marketplace from just prompts, to actual pictures seemed like a good idea. A "marketplace" for photos, where an event organizer could create a platform through which attendees could share photos of the event they were attending. However, the twist that would make this a compelling idea would be the integration of blockchain and crypto tokens.The idea would be to gamify the experience of contributing to the platform by making the contribution part of a contest. Think of the following flow:

An event organizer creates a contest, where they request attendees to contribute a particular type of image, They also set up a bounty for the contest, payed out in some crypto token
Users join using their crypto wallet, which allows us to ease the experience of setting up user accounts, and then contributes to the contest
The users' contributed pictures are ranked based on relevancy to the original ask of the contest, and after the contest closes, the users are rewarded a portion of the bounty of the contest, paid directly into their wallets.

With our idea set, we went about coding it all up!

Implementation Details

The main things I was responsible for in the project were parts of the backend, deployment (what a surprise), and the scoring algorithm. Let's jump into each section:

Scoring Algorithm

In order to facilitate the contest aspect of the marketplace, we needed some way of scoring a user's contributed image to the original event ask. The way that I thought to do this, was by using a couple of machine learning models. In a previous project, I had a similar problem where I was comparing a user submitted image to an original reference image. In this type of scoring scenario, we could use some more traditional comparative techniques, such as structural similarity analysis and other image feature detection algorithms, but in the case of this project, where a contest ask would typically be a prompt from the organizer of the contest, it made sense to take the help of machine learning to compare an image to words. The actual implementation first involved generating a caption to describe the image. To do this I made use of a pre-trained image to caption model, which would create a caption given an image. Once we had this set of words, we could make use of a more straightforward sentiment analysis model to compare the caption of the generated image and the original contest prompt and generate a score for the image. The implementation is fairly simply, using two models from HuggingFace (thank the AI gods), first to generate a caption based on an image, and then comparing captions by turning them into vector embeddings and finding the cosine similarity between the vectors. The code for my implementation was the following:

import os
from dotenv import load_dotenv
load_dotenv()
# PATH = <model_data_path>
# os.environ['TRANSFORMERS_CACHE'] = PATH
# os.environ['HF_HOME'] = PATH
# os.environ['HF_DATASETS_CACHE'] = PATH
# os.environ['TORCH_HOME'] = PATH
import base64
from PIL import Image
from io import BytesIO 
from transformers import AutoTokenizer, AutoModel, BlipProcessor, BlipForConditionalGeneration, GPT2Tokenizer, GPT2Model, AutoProcessor, VipLlavaForConditionalGeneration
import torch
from sklearn.metrics.pairwise import cosine_similarity
from utils import logger
logger.info(f'HF Cache: {os.environ.get("HF_HOME")}')


DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#DEVICE = 'cpu'

def LlavaModel():
    model_id = "llava-hf/vip-llava-7b-hf"

    question = "What are these?"
    prompt = f"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n{question}###Assistant:"

    kwargs = {
        "prompt": prompt,
    }
    model = VipLlavaForConditionalGeneration.from_pretrained(
        model_id, 
        torch_dtype=torch.float16, 
        low_cpu_mem_usage=True, 
    ).to(DEVICE)

    processor = AutoProcessor.from_pretrained(model_id)
    return processor, model, kwargs

def BLIPModel():
    kwargs = {}
    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")

    logger.info("Preparing image captioning mode")
    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
    model = model.to(DEVICE)
    return processor, model, kwargs

def caption_image(image, model, max_length=1024):
    # Processing image
    # Need to get image from API (Base64)
    logger.info("Processing image")
    image = base64.b64decode(image)
    raw_image = Image.open(BytesIO(image)).convert('RGB')

    # unconditional image captioning
    processor, model, kwargs = model()
    inputs = processor(raw_image, return_tensors="pt", **kwargs).to(DEVICE)

    out = model.generate(**inputs, max_length=max_length, max_new_tokens=1024)
    caption = processor.decode(out[0], skip_special_tokens=True)

    return caption

# Once we have caption, we need to compare it to original caption
def get_similarity_score(image_caption, event_prompt):
    def cls_pooling(model_output):
        return model_output.last_hidden_state[:,0]

    #Encode text
    def encode(texts):
        # Tokenize sentences
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

        # Compute token embeddings
        with torch.no_grad():
            model_output = model(**encoded_input, return_dict=True)

        # Perform pooling
        embeddings = cls_pooling(model_output)

        return embeddings

    # Load model from HuggingFace Hub
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
    model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
    logger.info(f'Comparing prompt "{image_caption}" to "{event_prompt}"')
    #Encode query and docs
    docs = [event_prompt, image_caption]
    query_emb = encode(event_prompt)
    doc_emb = encode(docs)

    #Compute dot score between query and all document embeddings
    scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
    logger.debug(f"Scores {scores}")
    normalized_scores = [score/scores[0] for score in scores]
    logger.debug(f"Normalized scores: {normalized_scores}")
    #Combine docs & scores
    doc_score_pairs = list(zip(docs, normalized_scores))

    #Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

    #Output passages & scores


    for doc, score in doc_score_pairs:
        logger.debug(f"{doc}:{score}")

    logger.info(f"Similarity score is: {doc_score_pairs[-1][1]}")
    return doc_score_pairs[-1][1]

def get_similarity_score_cosine(image_caption, event_prompt):
    def cls_pooling(model_output):
        return model_output.last_hidden_state[:,0]

    #Encode text
    def encode(texts):
        # Tokenize sentences
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

        # Compute token embeddings
        with torch.no_grad():
            model_output = model(**encoded_input, return_dict=True)

        # Perform pooling
        embeddings = cls_pooling(model_output)

        return embeddings

    # Load model from HuggingFace Hub
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
    model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
    logger.info(f'Comparing prompt "{image_caption}" to "{event_prompt}"')
    #Encode query and docs
    docs = [event_prompt, image_caption]
    query_emb = encode(event_prompt)
    doc_emb = encode(docs)

    #Compute dot score between query and all document embeddings
    scores = []
    for i in range(doc_emb.shape[0]):
        scores.append(torch.nn.functional.cosine_similarity(query_emb, doc_emb[i], dim=1))
    logger.debug(f"Scores {scores}")

    #Combine docs & scores
    doc_score_pairs = list(zip(docs, scores))

    #Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

    #Output passages & scores


    for doc, score in doc_score_pairs:
        logger.debug(f"{doc}:{score}")

    logger.info(f"Similarity score is: {doc_score_pairs[-1][1]}")
    return doc_score_pairs[-1][1][0]


def get_score(image:str, event_prompt:str, model=BLIPModel, scoring_func=get_similarity_score_cosine) -> int:
    image_caption = caption_image(image=image, model=model)
    similarity_score = scoring_func(
        image_caption=image_caption,
        event_prompt=event_prompt,
    )
    return similarity_score

def is_relevant(score, threshold=0.7):
    if score <= threshold:
        return False

    return True

Given more time outside of the constraints of the Hackathon, it would make sense to fine tune the image captioning models/use different models for different types of events so that they are able to pick out nuances in images that are uploaded. For example, say we are at a football game, we would want our scoring to be more nuances in terms of football, being able to distinguish teams, setpieces and maybe even players. Whereas, if the event were our Hackathon, we may want to be able to distinguish between sponsors, things happening on stage, or other such things. This would make the scoring significantly more fine-grained and intuitive.

Deployment and Packaging

The rest of the backend in our project was made in nodejs so as to more easily integrate with our frontend. This pose some challenges from the perspective of how we would call our scoring algorithms, written in Python, from NodeJS. We settled on making use of an API to call the Python code, as API calls were already going to be made to facilitate communication with the frontend, and this would not then be significant extra work to integrate the backend. From the Python side, I quickly coded up a FastAPI endpoint that could be called and would trigger the scoring algorithm. In addition to this, I packaged and containerized the Python, NodeJS backend, and frontend code all into containers that could be deployed as "microservices" and could communicate with each other using API calls.

Result

Overall, the Hackathon was a great experience. Withholding the fact that our project was runner up in two bounties and also snagged a couple of pool prices, resulting in total winnings of around 1.2k USD (which I was pretty happy about), I also was able to meet some really cool people and more importantly, was exposed to the huge world of crypto, and area that I will definitely delve deeper into in the near future!

Our full prize breakdown was as follows:

Filecoin - Build Data Economies & Tools Together with Filecoin Runners Up
Nethermind - Best Web3 Social 2nd place
Arbitrum - Pool Prize
Chiliz - Pool Prize

Furthermore, you can check out our project here: Momentor