Index Bug Haunts a Tech Company’s Search Engine Project

Index Bug Haunts a Tech Company's Search Engine Project

A mysterious bug has plagued a major tech company’s search engine project since February, randomly failing the index construction process. The issue is related to the code that merges partial indices during index building.

“The search engine constructs the reverse index through successive merging of smaller indices to reduce memory requirements,” explained lead engineer Jane Doe. “Suddenly, the code that merges these indices started failing randomly.”

The Bug’s Impact on Index Construction

The search engine operates by creating a reverse index through the successive merging of smaller indices, a process that is essential for reducing memory requirements.

Index Bug Haunts a Tech Company's Search Engine Project

The reverse index comprises two files: one containing offset pointers and another with sorted numbers. This process is initiated after each partition completes its crawling and processing, typically taking around four hours to run.

However, developers encountered a sudden and random failure in the code responsible for merging the indices.

The failure occurred when copying sorted numbers from an older index to a newer one, in cases where a keyword was present in only one of the indexes, thus not requiring an actual merge.

Document

Stop Advanced Phishing Attack With AI

Trustifi’s Advanced threat protection prevents the widest spectrum of sophisticated attacks before they reach a user’s mailbox. Stopping 99% of phishing attacks missed by
other email security solutions. .


Investigation and Troubleshooting

According to the report, Initial suspicions pointed towards a 32-bit integer overflow, as the index construction operates within the 1-32 GB file size range, where such errors are common.

Despite thorough code reviews and the addition of guard clauses and assertions, the issue persisted, with the copy operation attempting to copy outside the file.

In a surprising turn of events, the construction process completed successfully during troubleshooting, but the success was short-lived as the problem reoccurred upon subsequent runs.

The non-deterministic nature of the parallel merging process was thought to be a factor, but this did not fully explain the erratic behavior.

val = read-only mmapped file 
      not subject to change
counts = zeroed mmap:ed file 

long offset = 0;
for (int i = 0; i < length; i++) {
  counts[i] = val[i];
  offset += val[i];
}

long size = 0;
for (int i = 0; i < length; i++) {
   size += counts[i];
}

// ...

assert (size == offset);

// ...

truncate(size);

The team managed to push through the remaining partitions by repeatedly restarting the process, a tedious and time-consuming workaround.

Deep Dive into the Code

Further investigation ruled out integer overflow as the culprit, as the code in question used 64-bit longs and the values involved were not large enough to cause an overflow. A function that shrinks the merged index was also examined but disabling it did not resolve the errors.

A breakthrough occurred when a curious anomaly was discovered in the code, where an assertion comparing two calculated sizes would inexplicably fail. This led to the realization that the problem might lie outside the program logic.

Are you from SOC and DFIR Teams? – Analyse Malware Incidents & get live Access with ANY.RUN -> Start Now for Free 

Identifying the Root Cause

The team considered the Java Virtual Machine (JVM), the Linux kernel, and hardware as potential sources of the problem. The JVM was the prime suspect because the project had recently transitioned from OpenJDK to GraalVM.

Hardware issues were deemed unlikely, as they typically do not target a specific function repeatedly. Similarly, the possibility of a Linux kernel bug was discounted after reproducing the error on different machines with varying configurations.

Ultimately, switching the project’s Docker build process from GraalVM to Temurin (OpenJDK) resolved the issue, with the search engine functioning correctly thereafter.

While the bug has been nominally fixed, the exact cause remains elusive, making it difficult to file a detailed bug report. The developer has isolated the code that manifested the bug and conducted extensive testing without encountering the issue again, suggesting an intermittent problem that is difficult to pin down.

The resolution of the bug brings relief to the team, but the inability to understand the underlying cause leaves a sense of an impasse rather than a definitive solution. Despite this, the project can now move forward with a stable search engine index construction process.

Secure your emails in a heartbeat! To find your ideal email security vendor, Take a Free 30-Second Assessment.



Source link