Engineering Blogs Archives -

[Engineering Blog Template] #2

Lorem Ipsum Dolor Sit Amet Replace with Author Name 11 May 2025 4 minute read Overview Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Nulla facilisi. Fusce at nisl ut justo vehicula fermentum. Curabitur ut libero nec justo tristique tincidunt. Integer vel sem vitae nunc finibus eleifend. Suspendisse potenti. Cras convallis massa ac sapien tristique, non bibendum sapien tincidunt. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Nulla facilisi. Fusce at nisl ut justo vehicula fermentum. Curabitur ut libero nec justo tristique tincidunt. Integer vel sem vitae nunc finibus eleifend. Suspendisse potenti. Cras convallis massa ac sapien tristique, non bibendum sapien tincidunt. Details Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam sit amet libero vitae massa tincidunt lacinia. Sed nec velit lorem. Quisque fringilla justo vel enim bibendum, non sollicitudin libero vestibulum. Cras in faucibus ante. Suspendisse ut dictum nisl. Curabitur luctus fringilla odio, sit amet faucibus magna posuere in. Nulla facilisi. In ac lectus purus. Nam sagittis sodales sapien, nec imperdiet libero convallis vitae. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Morbi sollicitudin magna a orci varius, at congue tortor egestas. Integer a ante id libero sagittis tincidunt. Morbi dignissim ante vitae nisl pulvinar, nec convallis metus sagittis. Nullam eleifend, ipsum a faucibus viverra, ante neque accumsan nisi, non tincidunt nunc diam sed enim. Mauris volutpat, sem non vehicula sagittis, dolor nulla gravida ligula, vitae sagittis nunc sem vitae elit. Donec sed urna ut nisi feugiat iaculis. Pellentesque vitae semper lacus, id condimentum turpis. Integer mollis semper diam, vitae feugiat neque dictum at. Etiam vel lectus libero. Mauris non ultrices tortor. Integer ut tortor vitae diam dictum porta. Integer scelerisque augue ut sem volutpat tincidunt. Quisque congue nulla in est convallis, at egestas enim rutrum. Vestibulum sed turpis quis orci ullamcorper bibendum. Vivamus feugiat vehicula nisl, sed dapibus justo blandit id. Aliquam erat volutpat. Nunc scelerisque volutpat tellus, nec faucibus justo tincidunt et. Suspendisse facilisis, metus id ultricies bibendum, sapien sem malesuada quam, ut egestas augue velit a libero. Morbi condimentum lorem sed sem fringilla, eget ultrices tortor sollicitudin. Pellentesque dapibus augue sit amet ligula tempor, ut efficitur mauris tincidunt. Curabitur eget augue nec leo vehicula luctus. Mauris suscipit sapien et lacus finibus, nec pretium nulla facilisis. Donec placerat tellus ac pulvinar gravida. Curabitur hendrerit, ante vitae porta condimentum, eros est luctus enim, sed sagittis lectus libero vel velit. Curabitur id suscipit orci. Nam et erat id nulla luctus mattis. Aenean ut ex tortor. Suspendisse id purus in orci vestibulum lacinia at ac turpis. Integer placerat enim at lacus efficitur, sed vestibulum nisi ultrices. Aenean a lacus justo. Vestibulum imperdiet bibendum sem, et accumsan turpis tempor in. Suspendisse potenti. Vestibulum congue congue lorem, sit amet ultrices tellus hendrerit nec. Curabitur at sapien et augue porta vulputate. Etiam nec lacinia sem. Etiam cursus ipsum at malesuada lacinia. Curabitur id suscipit orci. Nam et erat id nulla luctus mattis. Aenean ut ex tortor. Suspendisse id purus in orci vestibulum lacinia at ac turpis. Integer placerat enim at lacus efficitur, sed vestibulum nisi ultrices. Aenean a lacus justo. Vestibulum imperdiet bibendum sem, et accumsan turpis tempor in. Suspendisse potenti. Vestibulum congue congue lorem, sit amet ultrices tellus hendrerit nec. Curabitur at sapien et augue porta vulputate. Etiam nec lacinia sem. Etiam cursus ipsum at malesuada lacinia. The figure below puts these projects, their scale, and complexity in perspective. Content Explorer See Scout in action. Schedule your demo now! Get in Touch

[Engineering Blog Template]

Test Blog

Film: Test Wolfgang Richter 15 February 2021 7 minute read Overview What if we could store 10-100x more photos? How many more memories would we keep? How many new applications would we invent? Soroco’s Film, a state-of-the-art image storage system, is a landmark rethinking of the storage stack for modern computer vision, machine learning, and web serving workloads achieving out of the box measured 12x storage savings on big data image data sets and 100x potential with optimization. In 2017, Soroco began storing big data for our Scout Enterprise product. We realized very early on that we had to drastically reduce our disk footprint, or else either we or our customers would have to build a storage layer that rivals large public clouds like AWS, Azure, or Google Cloud! That’s when we invented Film. Out of the box we achieve 12x storage reduction, and we believe that current optimizations we are researching can push that beyond 100x for our workload—achieving two orders of magnitude of on-disk storage saving. Getting to this point required a rethink of the storage layer for image data in combination with the workloads we wanted it to serve. We essentially compress and deduplicate pixel data while maintaining required accuracy for computer vision, machine learning, and other algorithmic workloads. While we use Film for screenshots, early tests show that it works well even for general photos as sampled from data sets like ImageNet. We have two key innovations in our state-of-the-art image storage system Film: Transcoding large (1,000+) groups of loosely related (temporally and spatially) images into large containers, and, Sorting those images by a perceptual hash to present them in an optimal order for pixel-level deduplication by a video codec Lossy Storage is OK Soroco is interested in machine workloads. We don’t have to show most of our images back to people, except in limited use cases. Thus, we only cared about keeping their fidelity just good enough for computer vision and related workloads. Critically, we don’t need pixel-by-pixel perfect image retrieval. Write once randomly, read many (usually) sequentially Our workload is write once, read many times. For our computer vision and machine learning applications, we can guarantee that the data sets are read off disk in sequential order. People-facing workloads may require random retrieval, but these were limited in our use case. We also feel that, in general, image retrieval is typically not in random order which is why caching layers often work so well. Popular images are easy to cache, and when an individual user browses their albums, we can guarantee an almost sequential retrieval pattern. Thus, we do not need to store each image separately for quick, random access. This means we can cut down heavily on wasted file system metadata and efficiently use a larger container for the images. Similar to a column-store style database, Film collates many related images in large containers. What we can’t control are the write patterns. We know users are likely, in our workloads, to upload many images at once or as a stream throughout the day. But, there is definitely more randomness in write patterns across a population of users. We do have to handle fast, high volume random insertions across our user population. This means we need a scalable approach to grouping images that doesn’t slow down the write path. Storing 12x More Pictures with Pixel Deduplication Image similarity with duplicate pixels stuck out to us as the primary cause of wasted space on disk. Why are we storing so many duplicate pixels? Actually, we unfortunately had a pathological case with our Scout Enterprise product. Our image data was screenshots from across the enterprise. Of course, desktops within one organization all look very similar. Often they have the same desktop background. Often, enterprise users are running similar software. For example, many users may work on Microsoft Excel for many hours. All of their screenshots would contain duplicate pixels, and individual users would have large amounts of duplicate pixels as they work on the same Excel file. So now the question becomes how will we deduplicate the pixels in massive amounts of screenshots? We very quickly realized that video codecs were a perfect solution to our problem, even though they weren’t designed to solve it. Video codecs deduplicate pixels inside frames and across multiple frames in a video stream taking advantage of temporal and spatial locality. We chose an off-the-shelf video codec HEVC/H.265, although AV1, VP9, or any other suitable codec could be used as a drop-in replacement. Film is future proof by virtue of its modular architecture from any advancements in video codec compression techniques. We set the quality settings of the codec to a high value (we default to CRF=28). Note, one optimization is to tune CRF based on the accuracy and recall of algorithmic workloads on top of our dataset. While we are OK with lossy storage, we still need Film capable of serving user-facing workloads such as web browser rendering and still be legible for some of our Scout Enterprise features. Figure 1. Storage container format. Many similar images are grouped together in containers, and then transcoded into a single, large video stream on disk. We break up images and transcode them typically 1,000 at a time. However, this is a tunable parameter and could really be anything. We felt it was a good tradeoff between transcoding time, number of unprocessed files we have to cache, and our server’s capacity (CPU, memory, disk). Increasing this value slows down transcoding times requiring more CPU, and decreasing this value lowers the maximum compression possible. Very large storage systems may want to increase this value to 100,000 or even 1,000,000+ especially for archival workloads. Storing 100x More Pictures by Optimizing Image Order with Perceptual Hashing There is a way we can save even more across our users and their images. Obviously, with images coming in roughly chronological order for each user, a video codec will find lots of duplicate pixels. For example, when they are working in Excel for extended periods of… Continue reading Test Blog

Global-scale Image Storage with Order(s) of Magnitude Less Space

Film: Global-scale Image Storage with Order(s) of Magnitude Less Space Wolfgang Richter 15 February 2021 7 minute read Overview What if we could store 10-100x more photos? How many more memories would we keep? How many new applications would we invent? Soroco’s Film, a state-of-the-art image storage system, is a landmark rethinking of the storage stack for modern computer vision, machine learning, and web serving workloads achieving out of the box measured 12x storage savings on big data image data sets and 100x potential with optimization. In 2017, Soroco began storing big data for our Scout Enterprise product. We realized very early on that we had to drastically reduce our disk footprint, or else either we or our customers would have to build a storage layer that rivals large public clouds like AWS, Azure, or Google Cloud! That’s when we invented Film. Out of the box we achieve 12x storage reduction, and we believe that current optimizations we are researching can push that beyond 100x for our workload—achieving two orders of magnitude of on-disk storage saving. Getting to this point required a rethink of the storage layer for image data in combination with the workloads we wanted it to serve. We essentially compress and deduplicate pixel data while maintaining required accuracy for computer vision, machine learning, and other algorithmic workloads. While we use Film for screenshots, early tests show that it works well even for general photos as sampled from data sets like ImageNet. We have two key innovations in our state-of-the-art image storage system Film: Transcoding large (1,000+) groups of loosely related (temporally and spatially) images into large containers, and, Sorting those images by a perceptual hash to present them in an optimal order for pixel-level deduplication by a video codec Lossy Storage is OK Soroco is interested in machine workloads. We don’t have to show most of our images back to people, except in limited use cases. Thus, we only cared about keeping their fidelity just good enough for computer vision and related workloads. Critically, we don’t need pixel-by-pixel perfect image retrieval. Write once randomly, read many (usually) sequentially Our workload is write once, read many times. For our computer vision and machine learning applications, we can guarantee that the data sets are read off disk in sequential order. People-facing workloads may require random retrieval, but these were limited in our use case. We also feel that, in general, image retrieval is typically not in random order which is why caching layers often work so well. Popular images are easy to cache, and when an individual user browses their albums, we can guarantee an almost sequential retrieval pattern. Thus, we do not need to store each image separately for quick, random access. This means we can cut down heavily on wasted file system metadata and efficiently use a larger container for the images. Similar to a column-store style database, Film collates many related images in large containers. What we can’t control are the write patterns. We know users are likely, in our workloads, to upload many images at once or as a stream throughout the day. But, there is definitely more randomness in write patterns across a population of users. We do have to handle fast, high volume random insertions across our user population. This means we need a scalable approach to grouping images that doesn’t slow down the write path. Storing 12x More Pictures with Pixel Deduplication Image similarity with duplicate pixels stuck out to us as the primary cause of wasted space on disk. Why are we storing so many duplicate pixels? Actually, we unfortunately had a pathological case with our Scout Enterprise product. Our image data was screenshots from across the enterprise. Of course, desktops within one organization all look very similar. Often they have the same desktop background. Often, enterprise users are running similar software. For example, many users may work on Microsoft Excel for many hours. All of their screenshots would contain duplicate pixels, and individual users would have large amounts of duplicate pixels as they work on the same Excel file. So now the question becomes how will we deduplicate the pixels in massive amounts of screenshots? We very quickly realized that video codecs were a perfect solution to our problem, even though they weren’t designed to solve it. Video codecs deduplicate pixels inside frames and across multiple frames in a video stream taking advantage of temporal and spatial locality. We chose an off-the-shelf video codec HEVC/H.265, although AV1, VP9, or any other suitable codec could be used as a drop-in replacement. Film is future proof by virtue of its modular architecture from any advancements in video codec compression techniques. We set the quality settings of the codec to a high value (we default to CRF=28). Note, one optimization is to tune CRF based on the accuracy and recall of algorithmic workloads on top of our dataset. While we are OK with lossy storage, we still need Film capable of serving user-facing workloads such as web browser rendering and still be legible for some of our Scout Enterprise features. Figure 1. Storage container format. Many similar images are grouped together in containers, and then transcoded into a single, large video stream on disk. We break up images and transcode them typically 1,000 at a time. However, this is a tunable parameter and could really be anything. We felt it was a good tradeoff between transcoding time, number of unprocessed files we have to cache, and our server’s capacity (CPU, memory, disk). Increasing this value slows down transcoding times requiring more CPU, and decreasing this value lowers the maximum compression possible. Very large storage systems may want to increase this value to 100,000 or even 1,000,000+ especially for archival workloads. Storing 100x More Pictures by Optimizing Image Order with Perceptual Hashing There is a way we can save even more across our users and their images. Obviously, with images coming in roughly chronological order for each user, a video codec will find lots of duplicate pixels. For example, when they… Continue reading Global-scale Image Storage with Order(s) of Magnitude Less Space

Why are we, at Soroco, inspired by astronomers – the OG computer hackers? [Part I]

Why are we, at Soroco, inspired by astronomers – the OG computer hackers? [Part I] Rohan Murty 26 February 2021 4 minute read Overview As a company, the more we build and scale Scout, the more we see strong parallels to how astronomers collect data, build pipelines, and find patterns in astronomical data. Like astronomers, at Soroco, we too worry about data capture, cleanliness, noise, aggregation, clustering, and filtering. Similarly, we too deal with similar scale of data capture on Scout and we focus on storage and indexing. And finally, Scout too has several pipelines for extracting faint patterns in the data, with many parallels to how astronomers find patterns in their data. In some cases, however, the challenges we see with Scout data exceed the challenges that astronomers face. For example, astronomers often have the facility to search for a pre-defined template or pattern. Or the rate at which Scout gathers data far exceeds the rate at which astronomy data gathering is growing (doubling roughly every two years). Hence, we look to astronomy to learn from the parallels while also being inspired to solve those problems specific to Scout. Hence, recently, we at Soroco, began working with Shravan Hanasoge, a professor at the Tata Institute for Fundamental Research (TIFR), on computational questions tied to Scout by looking for inspiration from how Shravan and his team think of finding patterns to detect gravitational waves. And this is not the first time we have collaborated with scientists and engineers working on astronomy. We even interview and recruit astronomers (apply here, if you’re interested!). So, when a couple of my younger colleagues, asked me why do we interview astronomers, read their papers, or even collaborate with them, we figured this may be a point of view that may be of broader interest to the community as well. This is not a conventional point of view among software companies and we believe it is fairly unique to how we, at Soroco, solve problems at scale, think of team composition, and value diverse talent. Our hope is this article will do justice to astronomers and readers will share our enthusiasm for astronomers. Astronomy and computing at scale Astronomy projects are about literally finding the needle in the haystack at scale. These projects generate a tremendous volume of data that are stored, ingested, and analyzed, to make new discoveries. Companies that build and manage platforms that collect, ingest, and analyze large volumes of data are typically well-funded and staffed by large teams of engineers, computer scientists performing R&D, product managers, QA teams, and SRE teams. On a relative basis, astronomy projects, however, are built by small teams of astronomers and engineers and operate on budgets that are several orders of magnitude lower (even when compared to a startup that has raised Series B). And yet, these teams in astronomy projects build systems that demonstrate tremendous scale by collecting large volumes of data, storing, indexing, processing, and ultimately finding a signal in the noise. And these systems survive and persist over extended periods. Long before the explosion of data in social networking companies, astronomy projects have often outpaced companies and the computing industry in their ability to build systems that process data at scale. This article hopes to highlight the incredible engineering work done by small teams of engineers and astronomers and here are some examples that illustrate this point on scale and complexity of astronomy projects. These projects produce large volumes of data, have to deal with scale, and are significant software engineering and innovation efforts. LIGO: The Laser Interferometer Gravitational-Wave Observatory (LIGO), a much vaunted project in the news, is dedicated to observing gravitational wave observations. LIGO generates about 1TB of data each night it operates. Future upgrades to LIGO as well as new planned sites such as LIGO-India will only increase these nightly data rates by several orders of magnitude. SETI@Home: The now defunct but one of the largest distributed applications (and in many regards the precursor to the sharing economy) started in the late 90s and ran for 21 years, was a distributed computing platform running on end-users’ desktops with the aim of analysing radio signal data to search for possible signs of extra-terrestrial life. Radio telescope data was aggregated to central servers and client software running on end-users’ machines, pull data from the central repository and analyze them locally. Just in 2008 SETI@Home was processing 100TB of data per year (in 2008 terms that was the size of the entire US library of congress). At its peak this distributed platform had a computing power of 668 teraflops across 5.2 million end-users running the platform. The underlying technology, built by the spaces sciences laboratory at Berkeley, was eventually open sourced as the BOINC platform – a platform for distributed computation that continues to be relevant even today in a wide variety of applications ranging from climate sciences to mathematics to biology and medicine. SDSS: The Sloan Digital Sky Survey is a project to construct a detailed 3D map of the universe. LSST: The large synoptic survey telescope is a large telescope slated to operate at very high data-rates and is equipped with a 3.2-billion-pixel camera that is capable of recording the entire visible sky twice each week. TMT: The thirty-meter telescope is an extremely large telescope (ELT) being built in Hawaii and will likely be the largest telescope ever built, is a multi-national project spanning research teams across the US, Israel, China, Canada, and India. The figure below puts these projects, their scale, and complexity in perspective. Source: Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy The ever-increasing data-rates per night of various projects over the past two decades. So, what does it take to store, index, and process data at such high-data rates? What kinds of queries can one run on this sort of data and what is the most efficient architecture for it? What pipelines are necessary to process the data and how are they integrated? What is the data model… Continue reading Why are we, at Soroco, inspired by astronomers – the OG computer hackers? [Part I]

Why are we, at Soroco, inspired by astronomers – the OG computer hackers? [Part II]

Why are we, at Soroco, inspired by astronomers – the OG computer hackers? [Part II] Abdul Qadir 4 june 2021 15 minute read Diving deeper into ZTF Of all the projects we have come across in astronomy, we see a strong parallel between the Zwicky Transient Facility (ZTF) and Scout. ZTF is basically Scout for the night sky. Or Scout is ZTF for the enterprise. Both systems span multiple areas of computing and at the heart of it solve a similar problem – how do you find faint patterns from noisy observational data at scale? ZTF is an automated system of telescopes that find transients (such as gamma ray bursts, comets, etc.), at Palomar/Caltech and generates ~ 4TB per night (assuming 100 observational nights in a year this is about 400TB / year). ZTF consists of a base platform, which collects, cleans, and stores the data. It is then processed through a series of successive pipelines to refine it and find patterns. Subsequently, the processed data, rich with possibilities, is then extended to address multiple astroinformatics questions. At the heart of it, ZTF is meant to find new patterns by comparing these patterns to previously known discoveries to ascertain the validity of the newly found pattern. Conceptually this is an example of what ZTF does: Source: ZTF Once a pattern has been discovered, ZTF classifies the new pattern or ‘alert’ into bins such as (“variable star”, or a false detection, etc.). Here is a snapshot of how ZTF classifies light curves or observations. Think of light curves as a particular hash or signature of an astronomical phenomenon. Here is an example of a light curve. Source: The ZTF Source Classification Project: I. Methods and Infrastructure These light curves are classified using a combination of machine learning and deep-learning. Here is a schematic of how ZTF classifies light curves. Source: The ZTF Source Classification Project: I. Methods and Infrastructure Classification uses supervised learning algorithms and sets up the classification problem as an optimization problem of minimizing the gap between a prediction and the ground truth observation. But why use any learning algorithms here at all? Besides the large volume of light curve data, it tends to be unevenly sampled, incomplete, and may be affected by biases (presumably from the equipment?). Hence, standard time series analysis may prove to be insufficient. Instead, this is where learning algorithms tend to do quite well. A whole body of prior work has demonstrated that learning algorithms tend to do well on these class of problems. Once a pattern is classified, ZTF has the potential to run several different pipelines to further validate the specific bin that the event has been classified into. For example, DeepStreaks is a component in the pipeline in ZTF that is used to identify streaking near-earth objects (NEO) (such as comets). Here is a high-level decision tree and sample results for how DeepStreaks decides if the candidate pattern is a plausible NEO, non-NEO events, or noise. Source: Matthew Graham, ZTF & Caltech Finally, all of these add up into Tails, the world’s first deep-learning based framework to assist in the discovery of comets. Tails is built on top of the base data gathering platform. Source: Tails: Chasing Comets with the Zwicky Transient Facility and Deep Learning, Dmitry A. Duev, NeurIPS 2020 The Tails architecture which employs an EfficientDet D0-based network Tails has been online since August 2020 and produces between 10-20 NEO candidates each night. Let us examine the achievement of this particular project in a historical context. Since the first homo sapiens, the ancients have always looked up at the night sky and wondered about our place in this universe. This very act has been the source of all inspiration – religion, art, science, literature, and pretty much everything mankind has done. More specifically, cave art from 40,000 years ago reveal the ancients tracked astronomical phenomenon such as comet strikes and planetary shifts. And what we see today with ZTF, is an example of how, this very old profession of humankind has today largely been automated with advances in contemporary computing. Fritz software platform All of these advances have culminated in the ZTF team open sourcing their underlying extensible data platform – Fritz. In many regards Fritz and the entire ZTF effort echoes the architecture, thinking, and design for how we at Soroco are building the Scout platform. The point here is just through the lens of ZTF we can see an example of the incredible range of expertise that the ZTF team of astronomers and engineers have had to develop to do their scientific work — signal processing, computer vision, deep learning, machine learning, clustering algorithms, infrastructure, storage, databases, API design, parallel processing, networking, and operating systems. And architecture, system design, and system integration on top of all that. Whew! This literally is an entire undergraduate computer science curriculum worth of skillsets rolled into one team! Think of this. When was the last time you knew of a software product or project built by a small team that spanned so many different areas of computing? At Soroco, whenever confronted with technical challenges, we remind ourselves of what these ninja teams in astronomy do and that humbles us and spurs us further. If you enjoy reading this article and want to work on similar problems, apply here and come work with us! Apply to Soroco today! Reflections Some computer science purists may argue a lot of this is about application of technology vs building ‘new’ technology. But we view these distinctions as irrelevant barriers. Instead, what astronomers have shown us, time and time again, is a focus on achieving the end outcome using computation and solving any and every problem that comes their way. It is precisely this confluence of different skills, technologies spanning the stack, and collaboration across physics to computer science that births new systems advancing the capabilities of any software system. In several cases, these teams may have perhaps applied existing algorithms and technologies but they have had to figure out how to integrate disparate components together, which components to pick, scale, performance, latency, accuracy, etc. And in some instances, they have had to solve hard computing problems on their own without necessarily waiting for computer scientists to solve these problems and then publishing them. Therefore, astronomers have had no choice but to mature into excellent computer scientists and engineers themselves. They have had to design, engineer, and solve their way to actually… Continue reading Why are we, at Soroco, inspired by astronomers – the OG computer hackers? [Part II]

Abstract Syntax Tree for Patching Code and Assessing Code Quality

Abstract Syntax Tree for Patching Code and Assessing Code Quality Abdul Qadir 4 june 2021 15 minute read Why should you care? How do we easily and scalably patch 100,000s of lines of source code? Read about how we used a simple yet powerful data structure – Abstract Syntax Tree (AST) to create a system that from one single central point, maps source code dependencies and in-turn patches all dependencies. Abstract A software system is usually built with assumptions around how dependencies such as the underlying language system, frameworks, libraries etc. are written. Changes in these dependencies may have a ripple effect into the software system itself. For example, recently, the famous Python package pandas released its 1.0.0 version, which has deprecated and changed several functionalities that existed in its previous 0.25.x version. An organization may have many systems using 0.25.x version of pandas. Hence, upgrading it to 1.0.0 will require developers of every system to go through the pandas change documentation and patch their code accordingly. Since we developers love to automate tedious tasks, it is natural for us to think of writing a patch script that will update the source code of all the systems according to the changes in new pandas version. A patch script could be parsing the source code and doing some kind of find+replace. But such a patch script will likely be unreliable and not comprehensive. For example, say the patch script needs to change the name of a function get to create wherever it is called in the code base. A simple find+replace will end up replacing the word “get” even if it was not a function call. Another example would be that find+replace will not be able to handle cases where code statements spill over to multiple lines. We need the patch script to parse the source code, while understanding the language constructs. In this article, we propose the use of Abstract Syntax Trees (AST) to write such patch scripts. And then later, we present how ASTs can be used to assess code quality. Abstract Syntax Tree (AST) Abstract Syntax Tree (or AST) is a tree representation of source code Wikipedia page. Almost every language has a way to generate AST from its code. We use Python to build several critical parts of our systems. Hence, this article uses Python to give examples and highlights, but the learnings from here can be applied to any other language. Python has a package called ast to generate ASTs. Here is a small tutorial on it.Code: Output: So, the head of the AST is a Module object, which makes sense. Let’s dig deeper in it. The ast package provides an ast.dump(node) function that returns a formatted view of the entire tree rooted at node. Let’s call it on head object and see what we get.Code: Output (prettified): Looking at the ast.dump output, we can see that the head object which is of type Module has an attribute body whose value is a list of 2 nodes – one representing var = 1 and the other representing print(var). The first node representing var = 1 has a target attribute representing the LHS var and a value attribute representing the RHS 1. Let’s see if we can print the RHS. Code: Output: So, it works as expected. Now let’s try to modify the RHS from value 1 to 2. Code: Output (prettified): We can see the value of the corresponding attribute has changed to 2. Now, we will want to convert the AST back to code to get the modified code. To do that, we will use a Python package called astunparse, for ast doesn’t provide this functionality.Code: Output: So, the modified code has statement var = 2 instead of var = 1 as expected. IntelliPatch Now that we understand ASTs and how to generate them, inspect them, modify them and re-create code from them, let’s go back to the problem of writing patch scripts to modify the code of a system to use pandas 1.0.0 instead of pandas 0.25.x. We call these AST based patch scripts as “IntelliPatch”. All the backward incompatibilities in pandas 1.0.0 are listed on this page. Let’s take the first backward incompatibility on the list and write IntelliPatch for that. Avoid using names from MultiIndex.levels In pandas 1.0.0, the name of a MultiIndex level can not be updated using = operator, instead it requires the use of Index.set_names(). Code using pandas 0.25.x: Output: The above code will raise a RunTimeError with pandas 1.0.0. For it to use pandas 1.0.0, it should be modified to the code below.Equivalent code using pandas 1.0.0: The IntelliPatch needs to do the following: Create AST of the given code and traverse it. Identify if any node represents the code of form <var>.levels[<idx>].name = <val> . Replace the identified node with the one that represents the code of form <var> = <var>.set_names(<val>, level=<idx>). Below is the IntelliPatch script that does that. intelli_patch.py Usage Example 1: Output: Usage Example 2: Output: In usage example 2, note that the code statement that is to be replaced expands to more than 1 line and is present within a function g that is present within a function f that is present within a class C. IntelliPatch handles this case as well. One can extend the patch script to take care of all backward incompatibilities in pandas 1.0.0. And then write an outer function that goes through every Python file of a system, reads its code, patches it and writes it back to disk. It is important to note that a developer should review the changes done by the IntelliPatch before committing it. For example, if code is hosted on git, then a git diff should be performed and reviewed by the developer. Impact At Soroco, we have written 5 IntelliPatch scripts so far that were ran on 10 systems. Each script successfully parsed and patched about 150,000 lines of code across 10 systems. In terms of productivity, this effort took one of our engineers three full… Continue reading Abstract Syntax Tree for Patching Code and Assessing Code Quality

Building Large Scale Systems and Products with Python

Building Large Scale Systems and Products with Python George Nychis 15 April 2021 20 minute read Overview At the beginning of Soroco’s journey, we had to answer a question that many engineering organizations have had to answer before. What programming language were we going to use when building and scaling our products? The reason that each organization needs to answer the question on their own is that every product’s goals, needs, and constraints are different. However, even with our own goals in mind (which we will explain), no language we could pick would be perfect. We would want to make a decision knowing each language’s potential and shortcomings. We would plan to overcome the key shortcomings to make our technology. Here are the kinds of typical scenarios that we have encountered and the challenges we face when automating or discovering transactions in a live enterprise environment: The automated or discovered work needs to closely match what teams were already doing on the ground. That is, use the same applications, the same data, and most-often follow the same steps. Therefore, a transaction in this context is determined by the steps taken by teams which manually execute the work. And this means that the right comparator set for scale and performance is the manual work that teams execute today to get the work done. Consequently, this almost always means dealing with highly legacy (including mainframes!) and varying enterprise applications, up to 80% of which typically do not have any API interface. Each transaction typically involves accessing approximately 7500 data fields in 71 screens, executing 216 steps, and context switching between 15-890 times between enterprise applications, and takes anywhere between 5-20 minutes to execute a single transaction. Data being pulled from multiple heterogenous enterprise applications – on average each instance involves gathering data from 5-20+ applications of which 40% tend to be legacy. Reading a diverse set of complex documents (e.g. invoices, legal documents, etc.) that requires complex NLP processing to extract structure from documents as well as compare, in near real-time, the semantic similarity of multiple documents. On average each process involves reading 15 different documents. Each automated transaction needs to have the same fidelity as humans, if not better, in terms of error rates, throughput, and reliability while being more scalable. Extremely high diversity in the set of processes, their steps, and the industries that they are executed in. For example, in this post alone, the data is based on 7 different industries and nearly 20 different functions. Hence, nearly 7 years ago when we sought out to finalize our decision on a programming language, we were designing and developing our automation and process discovery products. Our automation product was to be capable of handling billions of transactions a year for a single business process. Our process discovery product would need to be able to process billions of data points to discover millions of processes. Both would be distributed systems and deployed globally. The challenges in automating or discovering processes is that these are all running a live enterprise and feature the following issues: In 2020, Soroco achieved the scale we planned for when making these decisions. Within the past 12 months, Soroco’s Scout product has discovered over 1.3 million process transactions covering up to about 12 million hours of manual work. In 2020, Soroco’s automation systems have processed over 1.2 billion enterprise transactions across multiple clients to bring our customers savings and scale to the extent of over 2M hours. Note, however, most of these automated systems ran in sync with people’s working timings and on working days. This is because typically the automation execution is triggered by an incoming email, document, or an event that populates data in an enterprise system. Furthermore, our ability to ‘scale’ more transactions per second is significantly rate-limited by the delays and slowness of legacy enterprise apps that are not built for an automated layer of software running on top. Therefore, our point is not about merely optimizing for number of transactions per second. There are many systems where Python has been optimized for this metric alone. We cannot control for this metric in an enterprise automation setting built on top of legacy systems. Rather, our point is about ensuring high-fidelity and scalable execution of automation systems in the enterprise while also meeting enterprise standards of safety and reliability. Therefore, we needed to be able to architect and design our technology carefully. Though we think of picking a programming language to meet this kind of scale as a technical decision, it is important to keep in mind that scaling technology also means being able to scale the engineering team who builds it. The easier the product is to develop, and its code is to read, deploy, secure and maintain…then the better the technology’s development could scale. In this blog post, we will describe why Soroco chose Python and what we did to ensure we could develop reliably, at scale, and securely. Many of these properties were not ‘out of the box’ with Python 7 years ago. This was at a time when it was far from the most popular language, still considered ‘slow’ and a ‘scripting language.’ Python was far from being considered a language for building large scale systems. All of that has changed today, and in this blog post we will provide guidance in all of the following dimensions which helped us build products with Python. Predicted Growth of Python: Why we picked Python to make it easier to scale our engineering team, despite many of its limitations. The global education system provided hints that Python would be one of the most widely used and known languages in a few years from when we started. PEP484 and Enforcing Typing: How we overcame the downsides of being non-statically typed (e.g., more potential errors in runtime) by supporting the growth of Python’s PEP484 for ‘gradual typing’ while it was still in development. Developing an early PyCharm plugin that enforced it (before mypy was… Continue reading Building Large Scale Systems and Products with Python

Modernizing Object Storage for Cloud Native Deployments

Modernizing Object Storage for Cloud Native Deployments George Nychis, Vageesh Hoskere & Wolfgang Richter 3 August 2021 15 minute read Why should you care? Data storage is a universal need. Structured data goes into familiar stores like an RDBMS (PostgreSQL, MySQL, Oracle), but unstructured data can be housed in many ways. For example, object storage systems, key-value caches, document stores (if there’s some structure), and even flat files on a file system. This article details how and why your choice of unstructured storage: affects your scalability by making or breaking your cloud native capability, balloons your software maintenance cost, and limits the possible savings you could get on your cloud storage expenses. We take you on our journey from a home grown, flat-file-based object storage layer, to an off-the-shelf approach with MinIO which saved us 30% in storage costs and 90% in maintenance costs. Object Storage at Scale Soroco’s product Scout collects millions of data points every day from interactions between teams and business applications, during the natural course of a workday. From the collected interactions, Scout detects patterns in the data using various machine learning algorithms to help our customers find opportunities for operational improvement. Below, we show the flow of this information via an example using Scout events. Scout events are represented as JSON objects that are buffered in memory and then periodically stored in compressed, encrypted JSON files on disk. Compression minimizes the network bandwidth and storage requirements. Encryption protects sensitive data at rest and in flight. Buffering saves compute resources by batch processing events. These services can be in the cloud, or on premises if the customer prefers it. Scout’s data ingestion services then decrypt and decompress the JSON data after which the individual records are post-processed and stored in an RDBMS. The records can then be fetched by our various machine learning algorithms. A sketch of information flow in Scout We must store the original data though, because post-processing might transform or accidentally drop data that we find useful in the future. For example, an updated machine learning algorithm might want a re-interpretation of the features from the original samples. If we threw them away after post-processing, we could never go back to the original data to improve results. Of course, we also have to store screenshots somewhere, and our RDBMS did not seem like a good choice. A contributor to PostgreSQL benchmarked the performance of object storage in PostgreSQL as compared to disk and found a 10x slowdown in a read-based benchmark. You don’t want to store objects in PostgreSQL! A typical large scale deployment spanning 100s of teams and 1000s of users ingests approximately 2B objects equating to approximately over 130 TB per year (assuming 261 working days). The post-processed structured information stored in our RDBMS is orders of magnitude smaller because it is just the output of a feature engineering pipeline for machine learning algorithms. In addition to the storage needs, the total set of requirements we had come up with when looking for an object storage solution were: Handle our storage requirements of objects at scale Decouple storage from our local file system for reducing cost and maintenance Provide compression to reduce storage requirements Minimal maintenance requirements from our engineering team Support for detailed access control lists to protect the original data files Simple integration with cloud native storage services such as Amazon S3 and Azure Blob Storage Local storage if cloud native storage services are not available (e.g., for on-premises) A solution meeting all these requirements ought to be both – cloud-native and scalable. This would let our product handle substantial retention periods (1 year or more), on-demand random access read workloads, and all of the deployment scenarios we care about (bare metal, private cloud, public cloud). In the remainder of this blog article, we present the different approaches and trade-offs which lead to our final solution which saved us 30% in storage costs and 90% in maintenance costs. Considering our Options for Object Storage There are a few common options for object storage that we considered while evaluating different designs to meet our requirements. Filesystem-based Object Storage with References A low-complexity solution to object storage is to store objects on the disk and keep references to the available objects with any important metadata in a database or index. Git is well known for doing this and implementing a style of it called content-addressable storage (CAS). An example of this is illustrated below. As illustrated with a CAS system, objects are stored on the filesystem by their hash and any meta-data associated with them can be stored in a database or catalog. Benefits of file-based object storage are simplicity in design, and if CAS is used you will get de-duplication of objects for more efficient storage since multiple references can map to the same object on disk. No specialized systems are required to track the objects, and access to them will be as easy as filesystem reads. Downsides of the filesystem-based object storage are maintenance, lack of access control without building or using a more substantial system around it, and inaccessibility to a shared filesystem in modern cloud native deployments where services do not assume local storage. Though you could mount a network share, the performance impact of using an NFS share would likely be substantial. For these reasons, we believe that while this approach is fast and has simplicity, it does not meet a lot of our requirements. Distributed Object Storage To keep the benefits of filesystem-based object storage and overcome the limitations around access to the storage, distributed object storage systems such as Ceph and Swift were built. Their design is illustrated below, where a “storage cluster” is built by distributing objects across any number of block devices (e.g., bare metal disks). This storage is then made accessible through microservices with network accessible APIs to store and retrieve blocks, and fine-grained access control. An example distributed object storage deployment with Ceph (Credit: https://insujang.github.io/2020-08-30/introduction-to-ceph/) Benefits of distributed object… Continue reading Modernizing Object Storage for Cloud Native Deployments

Increasing the Accuracy of Textual Data Analysis on a Corpus of 2,000,000,000 Words

Increasing the Accuracy of Textual Data Analysis on a Corpus of 2,000,000,000 Words Michael Lee Response By Engineering, SRE, and Customer Success 4 December 2021 14 minute read Introduction Natural language processing (NLP) is one of the most active subfields of machine learning today. Text classification, topic modeling, and sentiment analysis have become vital techniques in a myriad of real-world technological applications, such as search engine optimization and content recommendation. E-mails, social media posts, news articles, and other documents are constantly mined for insights on human opinions and behaviors by scientists, large and small companies, and even state actors. At Soroco, natural language processing and machine learning-based classification of text are foundational to many of our products. In some instances, we may ingest between 200,000,000 and 2,000,000,000 words over the course of model training and analysis for a single team of workers using our Scout product. In this blog post, we will address some tips and tricks which we have found to significantly increase the accuracy of our models, including appropriate processing of text for the purpose of leveraging standard techniques from machine learning. Many advanced methods for performing text classification require careful modifications so as to respect the structure of multi-field textual data for optimal performance. To illustrate the benefits of these techniques, this two-part series of blog articles will demonstrate the following ideas: We will show how to represent text in a high-dimensional vector space with applications to a toy regression problem. We will detail how to perform multi-field textual data analysis using more sophisticated neural network technologies. Challenges of Text Analysis Text analysis is complicated by the variety of nuance inherent in each application: real-world problems in NLP require analyzing complicated bodies of text, such as ones which are split into multiple fields or contain many words of natural language. Text fields in an email may include the subject, the sender and recipient, the email body, and the contents of any possible attachments. To illustrate the challenges and important aspects of modern text classification, we have included a sample e-mail where Soroco’s marketing team announced NelsonHall naming Soroco as a leader in Task Mining. This e-mail, like any other, has multiple text fields. Based on the text of the e-mail body, its subject line, and the identities of its senders and recipients, we may wish to perform some kind of classification task. For example, we may want to classify the e-mail as a “Positive Announcement,” or detect that it’s related to marketing. However, how do we properly set up a model to classify this email correctly and efficiently? Here are some potential challenges we might run into as we train a model to identify “Positive Announcement” emails: We might be tempted to take the entire text and collapse it into a single block of text for the classifier. However, that would cause all fields to be weighted with equal importance, which is not the case. For example, for this task in particular, the presence of positive words such as “Congratulations” in the email’s subject might be more pertinent than the content of the email body. What if all e-mails passed to the classifier in training always had the opening “Congratulations to the Soroco Team”, but a future email began as “Kudos to the Soroco Team”? The input to the model during training and classification (e.g., how words are vectorized) can have a significant impact on the accuracy of the classification. Once we’ve vectorized the text fields appropriately, there are many different classifier types we might use to solve the classification problem. However, not all of them will necessarily perform well with our chosen vectorization method. Some classifiers may work better with fasttext or word2vec embeddings (which tend to produce dense, high-dimensional output), whereas others might work with tf–idf mappings (which are sparse but can be of much higher dimension depending on your corpus). Some experimentation may be required to find the model architecture that performs best for our input. In Part 1 of this blog article, we are going to first address the importance of the vectorization of words and show how this vectorization has implications on the model’s accuracy and performance. In the second article, we will show further illustrate the importance of training on multiple fields and the impact to the model. Building a Model with Word Embeddings The approaches we cover here are going to rely on the concept of a latent embedding of textual data into a vector space, such as through word embeddings. Although there are simpler statistical methods for doing basic NLP, such as tf–idf, for our purposes we prefer a method where we assign some notion of semantic significance to our data. This semantic significance is the key to making the model resilient to the many ways in which the same contextual meaning can be written (i.e., like “congratulations” and “kudos” in our example above). Word embeddings provide a vector space representation of each word in a vocabulary such that words which appear in similar contexts (such as synonyms) have representations which are close together in space. This helps us to be robust against the variation in word choice we might see in real-world applications, where we are trying to analyze textual data produced by people. From fasttext.cc The above diagram from Facebook Research shows the difference between two strategies for optimizing word embeddings. In the CBOW (“continuous bag of words”) approach, we train a network to predict the embedding of a word from the sum of embeddings of all words within a fixed-size window around that word. In the sample sentence, “I am selling these fine leather jackets,” the embedding of the word “fine” is predicted from the sum of the embeddings of the words “selling,” “these,” “leather,” and “jackets.” In contrast, in the skipgram approach, we train the network to predict the embedding of a word from a random word selected from that fixed-size window, so the embedding of the word “fine” is predicted from the embedding… Continue reading Increasing the Accuracy of Textual Data Analysis on a Corpus of 2,000,000,000 Words

Rapid Response to the Log4j CVE & Important System Design to Limit Potential Impact

Rapid Response to the Log4j CVE & Important System Design to Limit Potential Impact Article By George NychisResponse By Engineering, SRE, and Customer Success Introduction Though many of our readers will already be aware of this topic and the severity of it, we present Soroco’s response with its engineering team on the Log4j vulnerability. In this post we will describe how we learned of the vulnerability and provided rapid response to it within 24 hours and patched it across our production environments. Additionally, how our system architecture ensured that any impact of the vulnerability would be isolated due to container technology. After significant analysis of our product’s usage, we were not able to find any known impact of the Log4j vulnerability. Background On December 9th, 2021 the Log4j vulnerability was announced and recorded as CVE-2021-4422 with the highest score of 10. It allowed remote code execution which meant that an attacker could potentially execute a series of commands on a server where information was processed with the Log4j library. Those commands could, for example, read information on the server, modify information, delete data, or even attempt to extract sensitive information from it. If the reader wants a technical explanation of the vulnerability, we recommend the following post. What made this announced vulnerability so severe and the reaction to it so substantial was that it was “zero-day” and used so widely across the industry. Zero-day referred to attackers being aware of it before or at the same time as security researchers without a known patch. Its wide use across the industry included all major cloud services such as Apple, Google, Microsoft, Cloudflare, and Twitter. Additionally, around 32% of the Fortune 500 is reported to use ElasticSearch which leverages Log4j which likely contributed to the large response. How Soroco’s System Architecture Minimized Any Potential Impact Soroco’s use of Log4j was not direct, but rather indirect through our use of ElasticSearch and the ELK stack, like many others in the industry as descried above. This meant that our response was to patch a 3rd party technology’s use of the Log4j library rather than direct use of Log4j any of our products. What was most beneficial to minimize the potential impact of the vulnerability, was Soroco’s use of isolating its software systems and services with software containers. The way that Soroco deploys the ELK stack is through the modern use of software containers. Containers run individual software systems in isolation by packaging the software system and all other dependencies needed to run the software. Aside from the simplicity containers provide in running software, their isolation properties also provide security benefits since what is running inside the container is not (by default) given access to other software systems running in other containers. Even if remote code were executed using the Log4j vulnerability (which we have found no evidence of at Soroco), it could not access anything other than what was in the ELK container. At Soroco, this deployment model meant that the vulnerable ELK stack would not have access to other parts of our infrastructure such as a production database. Responding in 24 Hours and Patching the Product within 48 Hours Immediately with our knowledge of the vulnerability and understanding our use of it in our products, we began proactively notifying our customers of our acknowledgement of the published vulnerability and our expected response to it. This was received positively by our customers, as they were trying to reach out to other vendors while having received a proactive acknowledgement from us. After notification we began patching. By deploying the ELK stack through Elastic’s officially published containers for them, patching the vulnerability also became simple and would not require any substantial changes to Soroco’s product technology. We only needed to ensure our products could still run with small periods of ELK downtime while we patched them. Two changes were made to patch out the vulnerable 3rd party functionality in our production environments, as confirmed directly by Elastic to remove the vulnerable functionality. The first change that we made was to the start-up configuration for the ElasticSearch container image to include the – Dlog4j2.formatMsgNoLookups=true option which disabled the vulnerable functionality in ElasticSearch. In lieu of waiting for official fixes to LogStash so that we could react quickly, we modified Elastic’s published container for LogStash to remove the vulnerable JNDI functionality. Once these two changes were made, we immediately began updating these two containers across our production environments to remove the vulnerable functionality. Though again, our use of container technology helped isolate the vulnerable technology from major portions of our technology stack and production databases. If you enjoy reading this article and want to work on similar problems, apply here and come work with us! Apply to Soroco today! Learning from Soroco’s Response There are several things that we stressed to our teams globally from the announcement of the Log4j vulnerability and our response to it. First, we worked with our customer success team to ensure that we proactively acknowledged the Log4j CVE to all of our customers before they wrote to us looking for a response. This showed Soroco’s dedication to product security with our customers and assured them that we are continually monitoring systems for announced vulnerabilities. Second, Soroco’s engineering and site reliability engineering teams worked immediately on finding the known workarounds and patching the product within 48 hours. This showed our ability to quickly patch our product and in particular how the use of modern container technology made it simple. Lastly, Soroco’s use of container technology also limited the potential exposure from vulnerabilities by isolating the information and systems that they have access to. All of these together ensured our product and its use was safe for our customers. Content Explorer See Scout in action. Schedule your demo now! Get in Touch

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.