Levi Briggs Data Science Blog: February 2020

Wednesday, February 19, 2020

Stupid or Solid?

Reading both of these articles has been greatly insightful for me, and beneficial for my professional development. As a Data Science major, the curriculum goes to great lengths in exposing us to data gathering techniques and what to do with such large data-sets. There is not much focus on writing SOLID code and the many factors that are at play when trying to produce code that everyone can maintain. The Data Science curriculum is good at showing students how to manage the vast amount of data that is out there, and how to extract meaningful information and make predictions or decisions from it. We mostly use built in Python libraries or other machine learning tools such as scikit-learn or Pandas. My first two programming classes somewhat gave me good experiences in writing neat and readable code, however the main focus of my degree was data management, and finding insights from large data sets. These articles opened up a new world for me and showed me that writing good code is as much an art form as it is a technical skill. There is a great amount of effort that goes into writing SOLID code as the article describes. One must be aware and mindful of the people that will have to read and maintain the code down the road. It is not enough to just write functioning code, a good developer writes code that his team and possibly other people in the world will be able to comprehend and add to. One important concept that I was not aware of before reading these articles was that when writing classes, you want high cohesion and low coupling; meaning that you want to keep code together that is related in function but you also do not want to design your classes in such a way that many of them depend on another. The goal is to have your code work towards a common goal but different parts be independent if possible. I also learned that it is not ideal to prematurely optimize your code before it is even a working product. Working code is far better than optimized code that does not do what it is intended to. Both articles exposed me to concepts that I was not completely aware of. These software engineering principles will assist me in my career and make me a more versatile data scientist. It is always a good idea to continue your education and broaden your horizons especially in the fast growing and constantly changing field of technology and software.

Thursday, February 13, 2020

What's Happening?

There are a wide variety of interesting technological articles in this Association for Computing Machinery magazine. One that peaked my interest and is relevant to my field was the Computing Ethics, Engaging the Ethics of Data Science in Practice. The authors wished to seek more common ground between data scientist and their critics, and to discuss the possible issues that arise from the growing field of Data Science and its practitioners. They explain that there exists critical commentary of the field, and that these critics proclaim that data scientists do not recognize the power they wield and often times use such power in a reckless and unethical manner. These critiques are not new and are not based in much truth, There are some instances of Data Scientists and their firms abusing their analytical powers such as the Cambridge Analytica or Facebook controversies; but as a whole, Data Science is no more unethical than other computer science fields. It is the personal morals and end goals of specific people that lead to possible unethical situations. These accusations are based in ignorance of the field, and an overlook to the routine deliberate activities that these outsiders are thinking about when it comes to ethics. Solon Barocas and Danah Boyd, the authors of this article, provide examples of Data Scientist practicing ethics, much like many other fields. They explain that they engage in countless acts of implicit ethical deliberations while in the process of creating a meaningful machine learning model. Data Scientists have to deal with incomplete data, which the authors argue is as much a moral concern as it is a practical one. Choosing what data to use, and determining if it is useful based on where it came from is a common situation data scientist find themselves in. Validating a model, and how this said model will perform when deployed are also ethical concerns that are often overlooked by the outside community. There is a great need for careful judgement in this field, many times having to take into consideration the ramifications it will bring humanity, and how it will ultimately affect the world. Even attempting to address these ethical issues explicitly, practitioners face trade-offs that must be considered. The article then explains of a model with gender bias, and fixing the issue would have to sacrifice privacy. The authors want a collaborative and constructive dialogue between Data Science practitioners and their critics. They want the critics to realize the effort put in by these people, and the small, ethical decision that go into making their analysis. The commentary of the field is often created by people who are unaware of the actual practice of it. The authors argue that we need to make effort to work collectively to deliberate appropriately about the field which will reveal a common ground between the two groups, and lessen the gap in understanding.

Monday, February 10, 2020

This Bugs Me

After switching our focus to the Pandas library, I came came across a bug that has a healthy amount of comments and contributions. The specific problem is that there is an issue with slicing float indices using a data frame but there appears to be no issue with slicing integers. There is a great amount of discussion for this specific bug as it dates back to 2014. It seems to be one of the more prominent issues in the Pandas repository, and there is still recent activity in the thread. This bug is something our team could look further into and possibly resolve. Some contributors argued that this is a debatable issue, and that the behavior might not be completely problematic. Other entries describe that they should further clarify the issue, and designate it to a more specific thread. My theory as to why this bug has not been resolved, is because it is too general, and the community is divided on whether or not this constitutes as an issue. One person explained that this bug might exist because of old syntax, and outdated practices. There is still much debate on if this bug has been resolved. I personally was unable to reproduce this because my development environment is not set up adequately. Although there appears to be no official "bug tracker", there is an active community of users reporting issues with the software, and developers discussing the issues and if they are worth looking into. I am new to the world of bug fixing and triaging and is something that I need to develop skills in.

Levi Briggs Data Science Blog