Levi Briggs Data Science Blog: 2020

Sunday, April 12, 2020

Meeting Charleston

Due to the ongoing COVID-19 pandemic, I was unable to attend a physical meeting to fulfill the requirements for this blog post. I was originally planning on going to either the Data Science meetup, or the Linux Users group. Being the avid Linux user that I am, I was disappointed that I was not able to attend the meeting and potentially learn interesting things about the Linux operating system. I also have great disappointment for the missed networking opportunity that the Data Science meetup could have given me. This unforeseen pandemic has definitely disrupted the ending to this semester, and presented me with various challenges that I had to overcome. Although I was not able to attend an event this year, I do have a past experience with the Linux Users Group. When I was a child, my father actually took me to one of the meetings while he was attending classes at the College of Charleston. Although I do not remember much from the meeting, I will say that my interest in Linux was sparked because of it. The experience is one of the earliest memories I have of the operating system, and had a great influence on me at such a pivotal formative age. The meeting showed me the passion of the open source community, and how such people find joy in open source computing.

Wednesday, April 8, 2020

Chapter 9

With this project coming to a close, I am grateful for all of the concepts and skills that I have learned from this class. Although it was not the capstone that I was expecting, being responsible for contributing to an open source project has proved to be widely beneficial for my professional development. This class has broadened my horizons, and shown me the diverse world that open source development. I find beauty in this community driven software development. It exemplifies a passionate community that is looking out for the good of its people. They produce and maintain good software that will be in turn, used to create more good software and products that everyone can use. It is a beautiful cycle of code production that anyone can be involved with. Being familiar with this type of software development process, although not conventionally data science related, is still a skill that every prospective tech professional should be familiar with. I am glad that I was required to be exposed to this different curriculum as I feel that it has given me invaluable experience for my career in the industry. Data science is not just about finding meaning in large data sets and performing analytical tasks; a data scientist that is familiar with software development can prove to be an irreplaceable team member, and a great asset to the community. Chapter 9 outlines the end of a software project, and explains various concepts to ensure the smooth transition to the client. This chapter provides meaningful information for the hand-off to the client, and discusses the different choices a developer has to provide their client with the working product for years to come. There are many options in terms of transitioning to professional support, and hosting the code on a clients server for relatively cheap.

Tuesday, March 24, 2020

Chapter 6

The database is arguably the most essential part to many software projects. It is where the bulk of the important data from users is stored and able to be reused when needed. Part of the data science curriculum at the College of Charleston is centered databases and the management of large data-sets. With that being said, I have slight experience in several SQL queries and other database techniques from the DATA 210 class. The command line is a useful tool when querying data-sets, and trying to find trends in data. One interesting thing about data-bases is that the information stored inside "persists", or is able to exceed the lifetime of the actual program. The data is still able to be used and modified even if the program no longer is viable. Databases are stand alone things that coincide with programs to serve as a data storage. One is able to store and organize vast amounts of data with relatively simple commands. One such type of organization are tables, which have a fixed number of columns and varying number of rows. An important concept to consider when implementing tables is the idea of normalization. Normalization allows tables to be queried and compared using a standardized language. Tables are able to be quickly queried because the information has its own key which is referenced when searching. This chapter has served as good review for database management, and the specific queries that we learned in Data 210 at the College of Charleston. I have had many conversations with my father about the great importance of databases, and how a company will pay good money to have a database specialist. This is something that I have considered as a career, citing the need for more database professionals in the field. I feel as though as data-scientist with an expertise in database management should go hand in hand, and would make for an essential part of any team.

Monday, March 23, 2020

Chapter 5

Domain classes are a vital part of any software project, and should be considered carefully during the software development process. This object oriented approach to client solutions is what generally makes larger projects more manageable, and often times what makes it even possible. It is always a good idea to break your program up into various classes that contain specific instance variables and functions relative to that class, but also implement some kind of inheritance hierarchy. There are two approaches to coding domain classes in one's program; one being reusing legacy code, or code that has already been established and tested by other developers, and the more labor intensive alternative, starting from scratch. Downloading and modifying previously written code seems the most simple, and safe option when trying to code these classes, but sometimes this proves to be difficult if such classes do not exemplify your specific needs. That is the job of the developer, to determine what the clients needs are and to best serve them with the tools at hand. This chapter has been a good refresher for me in terms of classes in a project, and their specific attributes. Another big concept posed in this chapter is the idea of unit testing, and maintaining an effective testing strategy for your code. I have the most experience in testing throughout the process, or in other words, testing each piece of code and function before moving on. Building on this, the chapter introduced the idea of test driven development, or gathering testing requirements before anything concrete is actually coded. The second implementation of testing seems to be the most rigorous and time consuming, but allows for the most success. Test suites, which are a collection of unit tests, play a pivotal role in the testing process. They allow for the individual testing of specific modules in a program to determine if everything is working as it should. This chapter has provided me with beneficial review in terms of classes and unit testing, as well as introduced me to new concepts and specific frameworks continuing to build my skills as a software developer.

Monday, March 9, 2020

Release early and often

Proper documentation is an extremely vital part of good software. The longevity and viability of said software is greatly dependent on good documentation practices for which future contributors can read and fully comprehend to better maintain and alter the code. In a perfect world, there would be no need for documentation as said in chapter 8, every individual would be have the same coding conventions and would be able to understand any piece of code written by anybody. But sadly that is not the case, everybody has their own way of doing things; naming conventions, spacing and indentations are a couple of instances that allow for stylistic independence/ expression (Java specifically). Although many are taught to name their variables and methods with clarity and obvious purpose, some time constraints or other factors prevent developers from doing so. There is legacy code out there with simple variable or method names designated as such to save time and to get the code working. The idea of getting the code to work first and cleaning up later is prevalent in the field, but many projects such as government contracting do not allow for this to occur. These contracting jobs are in the market for code that does what they need it to do, no government projects will pay a team to go back and clean up the code, and make it more readable. Harsh deadlines and requirements make it hard for developers to take the time to make readable code, they are just looking to meet the deadlines and complete the sprints on time. Although this is the reality of some projects, it is still imperative that said developers take that extra time to document their code, making the lives of future maintainers, or the developers themselves much easier in the long run. Proper documentation practices can go a long way in this field, and many people will be greatly appreciative for it. One must write code that is relatively easy to understand or have documentation that adequately describes what the purpose of the software. Developers should maintain good developer documentation in the code base, ensuring the future understanding of every function and module. Technical writing is the same as any other kind of writing, know your audience and be as clear and concise as possible.

Wednesday, February 19, 2020

Stupid or Solid?

Reading both of these articles has been greatly insightful for me, and beneficial for my professional development. As a Data Science major, the curriculum goes to great lengths in exposing us to data gathering techniques and what to do with such large data-sets. There is not much focus on writing SOLID code and the many factors that are at play when trying to produce code that everyone can maintain. The Data Science curriculum is good at showing students how to manage the vast amount of data that is out there, and how to extract meaningful information and make predictions or decisions from it. We mostly use built in Python libraries or other machine learning tools such as scikit-learn or Pandas. My first two programming classes somewhat gave me good experiences in writing neat and readable code, however the main focus of my degree was data management, and finding insights from large data sets. These articles opened up a new world for me and showed me that writing good code is as much an art form as it is a technical skill. There is a great amount of effort that goes into writing SOLID code as the article describes. One must be aware and mindful of the people that will have to read and maintain the code down the road. It is not enough to just write functioning code, a good developer writes code that his team and possibly other people in the world will be able to comprehend and add to. One important concept that I was not aware of before reading these articles was that when writing classes, you want high cohesion and low coupling; meaning that you want to keep code together that is related in function but you also do not want to design your classes in such a way that many of them depend on another. The goal is to have your code work towards a common goal but different parts be independent if possible. I also learned that it is not ideal to prematurely optimize your code before it is even a working product. Working code is far better than optimized code that does not do what it is intended to. Both articles exposed me to concepts that I was not completely aware of. These software engineering principles will assist me in my career and make me a more versatile data scientist. It is always a good idea to continue your education and broaden your horizons especially in the fast growing and constantly changing field of technology and software.

Thursday, February 13, 2020

What's Happening?

There are a wide variety of interesting technological articles in this Association for Computing Machinery magazine. One that peaked my interest and is relevant to my field was the Computing Ethics, Engaging the Ethics of Data Science in Practice. The authors wished to seek more common ground between data scientist and their critics, and to discuss the possible issues that arise from the growing field of Data Science and its practitioners. They explain that there exists critical commentary of the field, and that these critics proclaim that data scientists do not recognize the power they wield and often times use such power in a reckless and unethical manner. These critiques are not new and are not based in much truth, There are some instances of Data Scientists and their firms abusing their analytical powers such as the Cambridge Analytica or Facebook controversies; but as a whole, Data Science is no more unethical than other computer science fields. It is the personal morals and end goals of specific people that lead to possible unethical situations. These accusations are based in ignorance of the field, and an overlook to the routine deliberate activities that these outsiders are thinking about when it comes to ethics. Solon Barocas and Danah Boyd, the authors of this article, provide examples of Data Scientist practicing ethics, much like many other fields. They explain that they engage in countless acts of implicit ethical deliberations while in the process of creating a meaningful machine learning model. Data Scientists have to deal with incomplete data, which the authors argue is as much a moral concern as it is a practical one. Choosing what data to use, and determining if it is useful based on where it came from is a common situation data scientist find themselves in. Validating a model, and how this said model will perform when deployed are also ethical concerns that are often overlooked by the outside community. There is a great need for careful judgement in this field, many times having to take into consideration the ramifications it will bring humanity, and how it will ultimately affect the world. Even attempting to address these ethical issues explicitly, practitioners face trade-offs that must be considered. The article then explains of a model with gender bias, and fixing the issue would have to sacrifice privacy. The authors want a collaborative and constructive dialogue between Data Science practitioners and their critics. They want the critics to realize the effort put in by these people, and the small, ethical decision that go into making their analysis. The commentary of the field is often created by people who are unaware of the actual practice of it. The authors argue that we need to make effort to work collectively to deliberate appropriately about the field which will reveal a common ground between the two groups, and lessen the gap in understanding.

Monday, February 10, 2020

This Bugs Me

After switching our focus to the Pandas library, I came came across a bug that has a healthy amount of comments and contributions. The specific problem is that there is an issue with slicing float indices using a data frame but there appears to be no issue with slicing integers. There is a great amount of discussion for this specific bug as it dates back to 2014. It seems to be one of the more prominent issues in the Pandas repository, and there is still recent activity in the thread. This bug is something our team could look further into and possibly resolve. Some contributors argued that this is a debatable issue, and that the behavior might not be completely problematic. Other entries describe that they should further clarify the issue, and designate it to a more specific thread. My theory as to why this bug has not been resolved, is because it is too general, and the community is divided on whether or not this constitutes as an issue. One person explained that this bug might exist because of old syntax, and outdated practices. There is still much debate on if this bug has been resolved. I personally was unable to reproduce this because my development environment is not set up adequately. Although there appears to be no official "bug tracker", there is an active community of users reporting issues with the software, and developers discussing the issues and if they are worth looking into. I am new to the world of bug fixing and triaging and is something that I need to develop skills in.

Wednesday, January 29, 2020

Reflections on Open Source in Today's World

After reading the article by Jason Evangelho, I have come to the realization that many people share the same experiences and frustrations when it comes to the Windows operating system. The author of this article details his complicated relationship with windows, and explains his eventual switch to the open source world of Linux. One specific anecdote that resonated with was how Evangelho would lose important work progress and file transfers due to the inopportune timing of Windows updates. I have personally been affected by these badly timed updates many times, and eventually switched to Linux just as the author did. He went on to explain how he was astounded by the responsiveness of Linux on older machines and how it almost ran better than his new Windows machine. Evangelho's interest in Linux was originally sparked by the release of Steam machines from Valve, and the access they provided in terms of gaming. Gaming on PC's was almost exclusive to Window's closed environment until the release of these machines and Valve's drive for open source gaming access. I have been a strong supporter of Linux, specifically Ubuntu, for over 10 years now. I agree with the author in that the freedom and stability these distributions give you are unparalleled, and are what make Linux far superior to Windows. The updates are not forceful, and they do not seem to break some feature every iteration as Windows updates do. The second article I read had to do with the Linux terminal, and the many tricks and shortcuts one can employ to maximize their efficiency. For example, Ctrl + L will clear the screen without having to type out "clear", and sudo !! will run the previous command but with administrative sudo privileges. Another useful shortcut I learned was grep -Ev '^#|^$' <file> which will display the file's content without comments or empty lines. The bash shell that Linux operating systems use can prove quite useful when coding or doing data-science work. Part of the reason I recently installed Ubuntu back on to my computer was because the native inclusion of git in the terminal. The bash shell and its commands are far more standard, and are more widely used in the industry than that of a windows command line. Although the two articles are not similar in content, they are both examples as to why I support Linux, and its many open source endeavors. From the responsive quickness to the desktop and its guis, to the freedom of customization, and the useful nature of the terminal, one can never go wrong with replacing Windows with Linux as their main operating system. Gone are the days where Windows was the only OS that supported various commercial applications; I will always be the one to recommend Linux and its many diverse distributions. There is a "flavor" for everyone, one that will suit their needs and personal aesthetic tastes.

Wednesday, January 22, 2020

Reflections on FOSS

I have always been a strong proponent of Linux and the use of open source software. Since middle school, I was always the "odd one" not using windows and its proprietary software. I would always have to find some work around in completing my work, but it was always worth the feeling of satisfaction and the freedom in using free and open source software. Open source software is a great collaborative effort by a community of people who are driven by a passion for their work and interest in helping the workflow of humanity. It is not motivated by financial profit, or esteem, it is an effort to make quality code for anyone to use and to contribute to. There is something so beautiful and amazing about such an unorthodox way of producing software. There is no structured hierarchy of top programmers who dictate the direction and release of such software. It is open to anybody to contribute and maintain. The freedom that comes with open source software is refreshing. One just has to download the source code and alter whatever they want about a particular distribution of linux, or some piece of software. They could make such changes without altering the original and could release as some other distribution. This collaborative community effort for designing and maintaining software is an astounding testament to the things we can achieve as a collective. This 'bazaar style' of coding is in stark contrast to the commercial production of code, one with strict deadlines and an established hierarchy of bosses demanding these deadlines be met. There is a great feeling of relief when contributing to these FOSS projects. It is a great site when there is a healthy community of people contributing to something with passion and excitement. The open source community almost feels as though it is a gathering of like minded people pursuing their passions of technology and the engineering of software. Although the process seems to be lacking in structure and organization, there are many example to the contrary. Take for instance the various popular linux distributions such as Ubuntu or Fedora; there are regular releases put out by a team of developers. These are all organized efforts to release these distributions but can also be contributed to by the community. A beautiful cathedral can be constructed through the hard work and dedication of a passionate community. This project has shown be the other side in this process. I am not just passively using this software anymore. I have been equipped with the tools needed to meaningfully contribute to, in my case, sklearn. There are many reported bugs and problems that I have been perusing on the github page. There is no shortage of work to be done for this machine learning library, the issues tracker page for scikit learn has appropriate labels for the types of problems and their respective difficulties.

Tuesday, January 14, 2020

Introduction

My name is Levi Briggs and I am from Charleston, South Carolina. I am a Data Science major with a cognate in Psychology. Upon graduation, I hope to find a job as a Data analyst or some other related data science role. I want to analyze sports or psychology data and hope to use data science techniques that I have learned in this program to improve those fields.