aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/misc/habanalabs/gaudi
Commit message (Collapse)AuthorAgeFilesLines
* habanalabs/gaudi: use 8KB aligned address for TPC kernelsTomer Tayar2022-09-201-3/+4
| | | | | | | | | | | | | | I$ prefetch is enabled when sending a TPC kernel to initialize the TPC memory, and it has a restriction that the base address will be aligned to 8KB. Currently the base address is 128 bytes from the start address of the device SRAM, so prefetching will start 128 bytes before the actual kernel memory. Modify the kernel address to be 8KB aligned. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: change TPC Assert to use TPC DEC instead of QMAN errTal Cohen2022-09-191-6/+6
| | | | | | | | | | This change is done while there is a problem to use QMAN error for TPC assert async. The problem involves security limitation that exists to generate the assert via QMAN error. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: rename error info structureDani Liberman2022-09-191-15/+16
| | | | | | | | | As a preparation for adding more errors to it, change to more suitable name. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: MMU invalidation h/w is per deviceOded Gabbay2022-09-191-4/+4
| | | | | | | | | | | | The code used the mmu mutex to protect access to the context's page tables and invalidation of the MMU cache. Because pgt are per context, the mmu mutex was a member of the context object. The problem is that the device has a single MMU invalidation h/w (per MMU). Therefore, the mmu mutex should not be a property of the context but a property of the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: new notifier events for device stateTal Cohen2022-09-191-5/+34
| | | | | | | | | | Add new notifier events that inform several device states. General H/W error raised on device general H/W error occurs. User engine error is raised when a device engine informs of an error. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: send device active message to f/wfarah kassabri2022-09-191-0/+6
| | | | | | | | | As part of the RAS that is done by the f/w, we should send a message to the f/w when a user either acquires or releases the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: unify hwmon resources clean upDani Liberman2022-09-181-17/+1
| | | | | | | | | Since hwmon fini code is common for all asics, unified it to common function. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: read div_sel value from firmwareOhad Sharabi2022-09-181-2/+3
| | | | | | | | | Even when running with unsecured f/w, we should read the PLL div_sel value from the f/w as this register is always privileged. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: fix print format for div_selOhad Sharabi2022-09-181-3/+1
| | | | | | | | Print format was for int (%d) while variable is u32. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: remove all kdma locksOded Gabbay2022-09-181-2/+0
| | | | | | | | We don't use KDMA concurrently in the driver. The only use is through debugfs and we don't protect concurrent access through it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: rename non_hard_reset to compute_resetOfir Bitton2022-09-181-2/+2
| | | | | | | | | | In order to be more explicit we should use the term compute_reset for describing the reset in which only the compute engines gets reset. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: Fix spelling mistake "Scrubing" -> "Scrubbing"Colin Ian King2022-09-181-1/+1
| | | | | | | | There is a spelling mistake in a dev_dbg message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: removed seq_file parameter from is_idle asic functionsDani Liberman2022-09-181-23/+24
| | | | | | | | | | Change is_idle functions so it would be more usable outside debugfs. Do this by replacing seq_file parameter with regular string. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: move h/w dirty message to debugOded Gabbay2022-07-121-2/+1
| | | | | | | | | H/W being dirty during initialization is completely expected in case f/w tools are used before loading the driver. As it is not an error, and as it doesn't give any meaningful information to the user, no point of printing it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add a value field to hl_fw_send_pci_access_msg()Tomer Tayar2022-07-121-3/+3
| | | | | | | | | | | For gaudi2 we need to send a value to F/W as part of the PCI_ACCESS packet. As a preparation, modify hl_fw_send_pci_access_msg() to have a 'value' field. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi2: remove unused definesOded Gabbay2022-07-121-7/+0
| | | | | | | There were some defines that are unused in the current upstreamed code. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: wait for preboot ready after hard resetOhad Sharabi2022-07-121-5/+14
| | | | | | | | | | Currently we are not waiting for preboot ready after hard reset. This leads to a race in which COMMs protocol begins but will get no response from the f/w. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: remove obsolete device variables used for testingOded Gabbay2022-07-121-130/+1
| | | | | | | | | | There are a couple of device variables that are used for testing purposes and they are set to fixed values. Remove the variables that are not relevant anymore and document the remaining variables. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: initialize new asic propertiesOded Gabbay2022-07-121-6/+12
| | | | | | | New asic properties were added for Gaudi2. We want to initialize and use them, when relevant, also for Goya and Gaudi. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add unsupported functionsOded Gabbay2022-07-121-0/+21
| | | | | | | | | | There are a number of new ASIC-specific functions that were added for Gaudi2. To make the common code work, we need to define empty implementations of those functions for Goya and Gaudi. Some functions will return error if called with Goya/Gaudi. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add gaudi2 asic-specific codeOded Gabbay2022-07-121-1/+1
| | | | | | | | | | | | | | | Add the ASIC-specific code for Gaudi2. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the Gaudi2 ASIC. It also contains the code to initialize the F/W of the Gaudi2 ASIC and to receive events from the F/W. It contains new debugfs entry to dump razwi events. razwi is a case where the device's engines create a transaction that reaches an invalid destination. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: use %pa to print pci bar sizeOded Gabbay2022-07-121-14/+11
| | | | | | | PCI bar size is resource_size_t so we should use %pa to make it work correctly on all architectures. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: replace hl_poll_timeout with while loopDafna Hirschfeld2022-07-121-12/+11
| | | | | | | | | | in gaudi_scrub_device_mem, replace call to hl_poll_timeout with a while loop to avoid using dummy variables. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: communicate supported page sizes to userOhad Sharabi2022-07-121-7/+0
| | | | | | | | | | | | Because in future ASICs the driver will allow the user to set the page size we need to make sure this data is propagated in all APIs. In addition, since this is already an ASIC property we no longer need ASIC function for it. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: page size can only be a power of 2Ohad Sharabi2022-07-121-1/+0
| | | | | | | | We dropped support for page sizes that are not power of 2. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: refactor dma asic-specific functionsOhad Sharabi2022-07-121-59/+30
| | | | | | | | | | | | | | | | This is a pre-requisite patch for adding tracepoints to the DMA memory operations (allocation/free) in the driver. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: remove unused enumOded Gabbay2022-07-121-22/+9
| | | | | | Also beautify code by preferring single line wherever possible. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: mask constant value before castOded Gabbay2022-07-121-4/+4
| | | | | | | This fixes a sparse warning of "cast truncates bits from constant value" Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: use correct type in assignmentOded Gabbay2022-07-121-1/+1
| | | | | | | packets are defined as LE so we need to convert before assigning values to them. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: fix function name in commentOded Gabbay2022-07-121-1/+1
| | | | | | function name in comment didn't match actual function name. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: use memory_scrub_val from debugfsDafna Hirschfeld2022-07-121-3/+2
| | | | | | | | | In the callback scrub_device_mem, use 'memory_scrub_val' from debugfs for the scrubbing value. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: don't send addr and size to scrub_device_mem cbDafna Hirschfeld2022-07-121-33/+31
| | | | | | | | | | We use scrub_device_mem only to scrub the entire SRAM and entire DRAM. Therefore there is no need to send addr and size args to the callback. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: fix a race condition causing DMAR errorYuri Nudelman2022-07-121-12/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a rare race condition in CB completion mechanism, that can occur under a very high pressure of command submissions. The preconditions for this to happen are: 1. There should be enough command submissions for the pre-allocated patched CB pool to run out of commands. At this stage we start allocating new patched CBs as they arrive. 2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below a cache line end. The flow: 1. Two command buffers being completed on different streams, at the same time. Denote those CB1 and CB2. 2. Each command buffer is injected with two messages, 16B each - one for a HBW update of the completion queue, another to raise interrupt. 3. Assume CB1 updated the completion queue and raise the interrupt. 4. Assume CB2 updated the completion queue but did not raise the interrupt yet. 5. The host receives the interrupt. It goes over the completion queue and sees two completions - CB1 and CB2. Release them both. 6. CB2 performs the last command. The problem is that the last command is split between 2 cache lines. So to read the last 8B of the last command, it has to access the host again. Problem is - CB2 is already released. This causes a DMAR error. The solution to this problem is simply to make sure the last two commands in the CB are always in the same cache line, using NOP padding. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: fix warning: var might be used uninitializedKoby Elbaz2022-07-121-1/+1
| | | | | | | | | | | kernel test robot: "warning: variable 'index' is used uninitialized whenever 'if' condition is false" Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: notify user process on device unavailableTal Cohen2022-07-121-1/+4
| | | | | | | | | | | When a device error occurs, user process would like to get some indication on the error by reading some device HW info. If the device is unavailable, user process can't perform any HW device reading. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: remove unused get_dma_desc_list_sizeOded Gabbay2022-07-121-1/+0
| | | | | | | | This asic callback function is not called anymore from the common code. The asic-specific function itself is called but from within the asic-specific code. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: fix shift out of boundsOfir Bitton2022-07-121-7/+9
| | | | | | | | | When validating NIC queues, queue offset calculation must be performed only for NIC queues. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: fix incorrect MME offset calculationKoby Elbaz2022-07-121-3/+8
| | | | | | | | | | | | | | | | | | Once FW raised an event following a MME2 QMAN error, the driver should have gone to the corresponding status registers, trying to gather more info on the error, yet it was accidentally accessing MME1 QMAN address space. Generally, we have x4 MMEs, while 0 & 2 are marked MASTER, and 1 & 3 are marked SLAVE. The former can be addressed, yet addressing the latter is considered an access violation, and will result in a hung system, which is what unintentionally happened above. Note that this cannot happen in a secured system, since these registers are protected with range registers. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: send device reset notificationTal Cohen2022-07-121-3/+10
| | | | | | | | | | | | Device reset event, indicates that the device shall be reset - after a short delay. In such case, the driver sends a notification towards the User process. This allows the User process to be able to take several debug actions for system diagnostic purposes. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: invoke device reset from one code blockTal Cohen2022-07-121-9/+16
| | | | | | | | | | | | | | In order to prepare the driver code for device reset event notification, change the event handler function flow to call device reset from one code block. In addition, the commit fixes an issue that reset was performed w/o checking the 'hard_reset_on_fw_event' state and w/o setting the HL_DRV_RESET_DELAY flag. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: collect undefined opcode error infoTal Cohen2022-07-121-21/+87
| | | | | | | | | | | | | | when an undefined opcode error occurres, the driver collects the relevant information from the Qman and stores it inside the hdev data structure. An event fd indication is sent towards the user space. Note: another commit shall be followed which will add support to read the error info by an ioctl. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: fix comment to reflect current codeOded Gabbay2022-07-121-2/+8
| | | | | | | | | | | Due to code changes in the past few years, the original comment of how parser->user_cb_size is checked was not correct anymore. Fix it to reflect current code and add more explanation as the code is more complex now. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: change the write flag name of error info structsTal Cohen2022-07-121-2/+2
| | | | | | | | | positive flags naming will make more clear code while adding more 'error info' structures Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs/gaudi: move tpc assert raise into internal funcTal Cohen2022-07-121-15/+12
| | | | | | | | | raising the tpc assert event in an internal function will make the code cleaner as we are going to be adding more events Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: add terminating NULL to attrs arraysDafna Hirschfeld2022-07-121-0/+1
| | | | | | | | | | Arrays of struct attribute are expected to be NULL terminated. This is required by API methods such as device_add_groups. This fixes a crash when loading the driver for Goya device. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
* habanalabs: use separate structure info for each error collect dataTal Cohen2022-05-221-8/+7
| | | | | | | | | | | | Create separate info structure for each error type. The structures shall be used inside the large structure that contains the last session error. This is more scalable for adding more errors in the future. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* habanalabs: do MMU prefetch as deferred workOhad Sharabi2022-05-221-7/+1
| | | | | | | | | | | | | When user requests to prefetch the MMU translations, the driver will not block the user until prefetch is done. Instead, the prefetch work will be delegated to a WQ which will do it in the background. This way, the prefetch may progress without blocking the user at all. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* habanalabs: add support for notification via eventfdTal Cohen2022-05-221-1/+13
| | | | | | | | | | | | | | | | | | | | The driver will be able to send notification events towards a user process, using user's registered event file descriptor. The driver uses the notification mechanism to inform the user about an occurred event. A user thread can wait until a notification is received from the driver. The driver stores the occurred event until the user reads it, using HL_INFO_GET_EVENTS - new ioctl opcode in the INFO ioctl. Gaudi specific implementation includes sending a notification on a TPC assertion event that is received from f/w. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* habanalabs: add device memory scrub ability through debugfsDafna Hirschfeld2022-05-221-8/+10
| | | | | | | | | | | | | | | Add the ability to scrub the device memory with a given value. Add file 'dram_mem_scrub_val' to set the value and a file 'dram_mem_scrub' to scrub the dram. This is very important to help during automated tests, when you want the CI system to randomize the memory before training certain DL topologies. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* habanalabs: use unified memory manager for CB flowYuri Nudelman2022-05-221-26/+16
| | | | | | | | | | With the new code required for the flow added, we can now switch to using the new memory manager infrastructure, removing the old code. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>