Pitfalls of Data Science
Over the past few years I have stepped my feet in and out of the water of data science and bioinformatics - most of it self-taught through necessity. In the past I’ve generated a multitude of genomics data from various sample types and library prep workflows that needed analysis. The need to make sense of all of the data was what ignited my love for linux, informatics, and coding. The tech realm is vast and one can never learn everything. I was definitely better in the past in some aspects due to immersion. About 5-6 years ago I was building command-line pipelines, improving in Docker, and solving some really interesting problems for my labmates. I’m back in a more data-centric role and have already seen myself make strides in these areas again.
Across all of the tutorials, course materials and classes I have taken for coding and data science, one of the most important aspects of this field goes silent: dealing with structured data. One of the first things data scientists should be learning is how to actually look at data in json and xml formats. I don’t understand why this is not even addressed - we deal with these data structures constantly and have yet to see these covered in course materials unless hunting them down.
xPaths for xml - xml.etree
Let’s start with xml. I’m going to use the example xml from w3schools to work demonsrate.
First let’s import os and ElementTree, load in the xml(saved as a file).
Output:
xPaths can also be set similar to wildcards by using “.//itenname” - the specified tag will be found throughout the entire document, regardless of its hierarchy. The ‘find’ attribute can also be utilized to find tags. Once a tag is found, the text/data within the tag can be called using .text. Expanding the loop above:
Output:
json
More and more data are being structured in json format. That’s why I feel that, just like xml, that working with these file structures should be one of the first thigs you learn in coding. JSON looks more akin to a key:value pair notation in python, but it’s still nested and structured like xml.
Let’s look at some json. This example will be the first listed at [https://json.org/example.html]{https://json.org/example.html}
Output:
Navigating structured data is essential for dealing with large data. Obviously resources are available when seeking out help. My main complaint arises in how crucial this is, and how it’s absent for resources and tutorials that I’ve come across.