GSoC 2021 — Biweekly Report 2

3 min readJun 29, 2021

So this report consists of two portions . The first is the last of week of the community bonding period and the other part was the week one of the coding period .

The last week of community bonding gave some deeper insights into the ideas of how my work should be working and how my approaches should shape . Towards the start of the first week , we had a wikimedia-ml team meeting , in which the idea of language agnostic approaches was given in .

The idea was something that despite such experience in the NLP field required me to do my own research . I decided to prune the problem , into certain sections .

We had to leverage the fact that fitting a model for the entire language set can be difficult problem to deal rather than I will divide the languages into sections on the basis of language families . This way the task can be easily handled .
The language families that I started to go ahead with was one of the Indian families — i.e Dravidian Languages and the Indo -Aryan languages.

Now in the coding period has begun and I had to look for approaches to deal with the issue of language agnostic models and that I had to deal in the first week of the coding period .

In that coding period , I had a meeting with Isaac Johnson ( one of the few people who worked on ores and went into the neural approaches for the task of Topic Classification ) . I tried to get his insights about how shall I proceed with the task as it had no such boundaries . He worked onto the articletopic field and told me that one thing that I should begin my work with is by starting from a task which has never been explored so I picked edit_quality as my task . I read multiple approaches regarding the multilingual aspect about the

MuRIL: Multilingual Representations for Indian Languages — This paper gives about the idea about a LM named MuRIL , a multilingual LM specifically built for IN languages which is trained on significantly large amounts of IN text corpora only by augmenting monolingual text corpora with both translated and transliterated document pairs and supposedly outperforms m-bert for the Indian Languages.
How Multilingual is Multilingual BERT? — This paper deals with the working of m-Bert and its performances . Before trying an approach , doing proper research about it was something that I believed in , and that’s why I analysed this paper and read about it in details .
Language-agnostic Topic Classification for Wikipedia — This was the approach that Isaac shared with me and this work is not really realted to my task but apparently does informs about the data and that itself is a tricky part in making a model . Hence , this approach too was neccesary to go through .

After going through multiple resources , the next task would be to extract data by going into the knitties of the existing data retrieval mechanism . Hence next two weeks will be extremely devouted to data fetching .

The reports and approach documents will be pushed into the respective github repository .

for any doubts , feel free to ping me at — anubhav.sharma@research.iiit.ac.in or either in zulip .

GSoC 2021 — Biweekly Report 2

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Anubhav Sharma

No responses yet